AI University/Docs

AI University Agents: Production Agents in 2026

A deep dive into The AI University's 31-agent production system. Covers the blueprint architecture that solved tool compliance, the simulation lab that pre-computes strategies, the knowledge graph that learns from every interaction, and the MiroFish-inspired swarm intelligence upgrades that gave agents multi-dimensional reasoning.

Last updated: 2026-03-21

AI University Agents: Production Agents in 2026

We run 31 AI agents in production. They send emails, prospect on LinkedIn, red-team each other's strategies, simulate customer responses before reaching real leads, track prediction markets, and file performance reports — all without human intervention.

This is not a demo. It is not a weekend project. It is a production system that has been running autonomously for months, and it has broken in almost every way you can imagine. What follows is what we actually built, what failed, what we changed, and what works now.


The short version

The system has 31 agents organized into four groups: 15 working agents that touch real leads, 10 simulation agents that think and test strategies before anyone acts, 5 ad operations agents, and 1 market intelligence agent. They share a knowledge graph with 1,100+ triples, a talkspace communication layer, an episodic memory system with 92,000+ episodes, and a trust verification gate that prevents simulation outputs from reaching working agents until they are verified.

The biggest lesson: giving an LLM 35 tools and hoping it picks the right ones does not work. We solved this with a blueprint system that moves orchestration into deterministic code. The LLM handles creative subtasks. The code enforces the sequence.


The 31 agents

Working agents (15)

These agents interact with real leads, real email systems, and real data. Every action they take is visible to the outside world.

AgentJobDaily Budget
OutreachSends personalized 3-email sequences to convert visitors into subscribers100
Reply HandlerClassifies incoming replies, routes interested leads to follow-up60
OnboardingGuides new subscribers through activation, prevents early churn40
RetentionMonitors engagement health, intervenes when churn signals appear30
Win-BackRe-engages churned subscribers with targeted offers30
Campaign ManagerCreates hyper-targeted email campaigns using behavioral data40
Content EngineGenerates social posts, comparison articles, thought leadership40
Growth AnalystAnalyzes traffic patterns, generates ad copy, produces conversion insights40
Marketing StrategistFull-funnel analysis, identifies gaps, creates task backlogs for other agents30
Partnership AgentFinds and pitches resellers, affiliates, creator professors30
Competitor WatchMonitors competitor activity, feeds intelligence to positioning30
Competitor GapMines customer complaints, finds positioning opportunities30
LinkedIn ProspectorDiscovers decision-makers via stealth browser automation30
AI Trend MonitorTracks AI trends on X, GitHub, Reddit — creates LinkedIn posts30
Brain MaintenanceDeduplicates memory, cleans knowledge graph, prunes stale data20

Simulation agents (10)

These agents never touch real leads. They think, test, discover, and invent — then push findings to working agents through the knowledge graph, talkspaces, and journey plans.

AgentNameJob
sim-directorStrategy LabPre-computes 3 strategies per lead via Monte Carlo sampling, roleplays as the lead to evaluate each, writes the winner as a journey plan
sim-challengerDevil's AdvocateRed-teams pending emails with adversarial personas (Skeptic, Competitor's Customer, Budget Blocker). Scores vulnerability 0-100. Score above 85 vetoes the email
sim-inventorTool BuilderAnalyzes failure patterns, creates new skill scripts to fix them, finds underused capabilities
sim-researcherMarket IntelligenceBuilds persistent customer proxies (Stanford Generative Agents pattern) that accumulate memory across runs
sim-trackerOutcome TrackerCompares predictions vs actual outcomes, updates strategy weights, applies pheromone decay
sim-auditorTrust GateMandatory verification gate between simulation and working agents. Stamps every item with a trust score 0-100
sim-profilerBehavioral ProfilerProfiles leads using Chase Hughes' Validation Needs, Six-Axis Model, and Authority Style detection
sim-persuaderPersuasion ArchitectGenerates FATE-structured persuasion strategies (Focus, Authority, Tribe, Emotion) per lead
sim-objectionObjection SimulatorRoleplays AS the lead using their behavioral profile, stress-tests outreach from 3 adversarial angles
sim-calibratorBehavioral CalibratorCompares predicted behavioral responses vs actual outcomes, tunes models per ICP segment

Ad operations agents (5)

AgentJob
Ad StrategistCross-platform performance analysis, budget allocation
Ad MonitorReal-time campaign monitoring, anomaly detection
Ad OptimizerBid and budget adjustments, A/B test winner implementation
Creative AnalystAd fatigue detection, creative performance scoring
Ad LearnerPost-campaign analysis, ROAS pattern extraction, playbook generation

Market intelligence (1)

AgentJob
Market OracleTracks Polymarket, PredictIt, Kalshi for AI-related prediction markets. Maps crowd-sourced probability signals to ICP sectors, feeds momentum shifts to outreach timing

The blueprint architecture

This is the most important section. Everything else we built was necessary, but nothing mattered until we solved the tool compliance problem.

The problem: tool paralysis

When you give an LLM 35-40 tools, it does not carefully pick the right ones. It picks the easiest 5 and ignores the rest. We measured this: 5% tool compliance. Mandatory steps — like updating the CRM after sending an email — were skipped 95% of the time.

The knowledge graph stayed empty. The learning store collected no lessons. The episodic memory system existed but nobody wrote to it. The agents were individually capable but collectively useless.

The solution: deterministic orchestration

The core insight came from studying how Stripe's agent framework handles complex workflows: move orchestration out of the LLM and into deterministic code. The LLM becomes a worker that executes creative subtasks. The code enforces the sequence.

A blueprint is a directed graph of nodes. Each node has a type:

  • Deterministic nodes — Fixed code. Zero LLM tokens. Cannot be skipped. Examples: fetch CRM data, run scoring skills, update database records.
  • Agent nodes — Creative LLM work with 5-8 curated tools and a tight turn budget (3-8 turns). Examples: compose an email, analyze competitive positioning.
  • Gate nodes — Checkpoints that verify conditions before proceeding. Examples: is this lead contactable? Did enrichment succeed?
  • Fork nodes — Branching logic based on data. Examples: which email number is this? Is the lead score above 50?

Example: the outreach blueprint

Step 1: Gather Lead Data                    [DETERMINISTIC]
Step 2: Is This Lead Contactable?           [GATE]
Step 3: Enrich Lead                         [DETERMINISTIC]
Step 4: Analyze Engagement Velocity         [DETERMINISTIC — runs skill scripts]
Step 5: Which Email Number?                 [FORK]
Step 6: Plan the Email                      [DETERMINISTIC]
Step 7: Compose the Email                   [AGENT — 8 tools, 5 turns max]
Step 8: Post-Email Updates                  [DETERMINISTIC — CANNOT BE SKIPPED]
Step 9: Post Signal to Talkspace            [DETERMINISTIC]

Seven of nine steps are deterministic. The LLM only fires for step 7 — composing the actual email — with exactly 8 tools and a maximum of 5 turns. Everything else is guaranteed to execute.

The result

MetricBefore BlueprintsAfter Blueprints
Tool compliance5%100%
Token spend per run~30 turns, high cost3-8 turns per node
Knowledge graph writesRarelyEvery run (deterministic)
CRM updatesSkipped 95%Never skipped
Signal postingInconsistentEvery run (deterministic)

How agents communicate

Agents do not share a single context window. They communicate through four mechanisms:

1. Talkspaces

Named channels where agents post scored signals. Each signal has a domain, urgency, confidence, and an actionFor field that targets specific agents. Seven channels: #strategy, #outreach-review, #competitive-intel, #content-collab, #agent-learning, #simulation, #market-intel.

When an agent starts a run, it reads its subscribed talkspaces. Signals tagged with actionFor: "outreach" appear in the outreach agent's context as ACTIONS FOR YOU. This is how sim-director's journey plans reach the outreach agent without sharing a context window.

2. Knowledge graph

1,100+ triples connecting leads, companies, competitors, segments, strategies, and objections. Every triple has temporal validity fields (validAt, invalidAt, expiredAt), a confidence score, an audit status, and a source agent.

Agents query the graph before making decisions. They write to it after learning something. Brain Maintenance runs deduplication (MinHash, Jaccard >= 0.9), contradiction resolution (newer facts supersede older), and community detection (label propagation clustering).

3. Event bus

Publish-subscribe events between agents. Each agent declares which event types it cares about. When outreach sends an email, it emits a contact-made event. When a lead converts, it emits a lead-converted event. Other agents pick these up on their next run.

4. Episodic memory

92,000+ episodes tracking every tool call, every decision, every outcome. Pattern extraction identifies recurring successes and failures. The learning store aggregates these into segment-level insights — which email tones work for which segments, which send times produce the highest open rates.


The simulation lab

The 10 simulation agents form an autonomous R&D lab that runs continuously alongside the working agents. They do not send emails, write content, or do outreach. They think, test, and discover.

Pre-computed strategies

Every 2 hours, sim-director scans upcoming leads and generates 3 distinct strategy approaches per lead using different persuasion weight combinations:

  1. Authority + Social Proof heavy (case studies, expert positioning)
  2. Scarcity + Commitment heavy (limited spots, progressive commitment)
  3. Reciprocity + Liking heavy (free value, personal connection)

For each approach, sim-director roleplays AS the lead using CRM data, behavioral signals, and ICP baselines to predict how they would respond. The winning strategy is written as a journeyPlan on the outreach record. When the outreach agent runs, it follows the plan automatically.

Red team before send

sim-challenger reviews every pending email before it goes out. It roleplays as three adversarial personas:

  • The Skeptic: "Why should I trust an unknown AI university?"
  • The Competitor's Customer: "I already use Coursera. Convince me to switch."
  • The Budget Blocker: "EUR 49/month is too expensive for something unproven."

Each email gets a vulnerability score. Above 70: a CHAOS alert signal goes to #outreach-review. Above 85: the email is vetoed — it does not send.

Trust verification gate

No simulation output reaches working agents without passing through sim-auditor. Every signal, every strategy, every knowledge triple from the simulation lab is tagged auditStatus: "pending-audit". sim-auditor scores each item 0-100:

  • 80-100: High confidence, verified, passed through to working agents
  • 60-79: Medium confidence, passed with flag
  • 40-59: Low confidence, flagged for human review
  • Below 20: Rejected, never reaches working agents

MiroFish-powered intelligence

We reverse-engineered MiroFish, a swarm intelligence engine with 37,500 GitHub stars that simulates thousands of autonomous agents in parallel digital worlds. We adapted 9 of its patterns for our system.

InsightForge (multi-query decomposition)

Before: an agent asks "Why are German technical leads not converting?" and gets back 3 random knowledge graph triples.

After: the question automatically decomposes into 5-7 sub-queries — "German leads conversion rates", "technical-evaluator segment behavior", "recent email performance Germany", "competitive landscape German market". Each runs against the knowledge graph, RAG store, AND agent memories. Discovered entities are cross-referenced into a network map.

Result: 30+ facts from a single query instead of 3. Agents can now answer complex strategic questions that were impossible with single-shot retrieval.

ICP behavioral baselines

10 ICP segments, each with calibrated behavioral parameters:

SegmentActivityDecision SpeedPrice SensitivityChannel Preference
Startup Founder0.93 daysLow (0.3)Email, LinkedIn DM
Enterprise CTO0.345 daysLow (0.2)Email, Referral
Career Switcher0.77 daysHigh (0.8)Email, Social
Agency Owner0.614 daysMedium (0.5)Email, LinkedIn, Partnership
Technical Evaluator0.421 daysMedium (0.4)Email, GitHub, Technical content

These baselines calibrate every simulation. When sim-director tests an approach against a "startup founder" proxy, it uses activity level 0.9, 3-day decision cycle, and low price sensitivity. When testing against an "enterprise CTO", it uses activity 0.3, 45-day cycle, and committee-driven decision making.

Parallel multi-channel simulation

Before: sim-director could only test email strategies. After: it simulates email, LinkedIn DM, partnership, referral, content-nurture, webinar, and phone channels simultaneously for the same lead.

Each channel has a calibrated model adjusted by the lead's ICP baseline. For a startup-founder who prefers LinkedIn DM:

  • Email: 25% open, 4% reply, 1.5% conversion
  • LinkedIn DM: 58% open, 10% reply, 3.2% conversion
  • Partnership: 35% open, 12% reply, 4% conversion

The system recommends the best channel and a multi-channel fallback strategy: "Lead with LinkedIn DM. Follow up via partnership if no response after 3 days."

God View scenario injection

Mid-simulation, any sim agent can inject a market disruption — "Coursera drops to EUR 19/month" — and watch customer proxies react in real time. Each proxy responds based on its ICP baseline. A price-sensitive career-switcher (sensitivity 0.8) shows high impact. An enterprise CTO (sensitivity 0.2) barely notices.

No simulation restart needed. Five scenarios tested in the time it used to take for one.

Temporal knowledge graph

Every fact in the knowledge graph is classified as active, historical, or stale. "Coursera pricing is EUR 39/month" might have been true 6 months ago but not today. Agents can query panorama_query to see which facts are fresh and which are outdated. Brain Maintenance automatically flags facts older than 30 days without re-verification.

ReACT report generation

Weekly intelligence reports are no longer template fill-ins. sim-tracker uses ReACT reasoning loops — Thought, Action, Observation — to investigate topics. It plans 2-5 sections, runs 3+ evidence queries per section across the knowledge graph, RAG store, and agent memories, and produces reports with cited evidence.


The knowledge graph

The knowledge graph started as a simple triple store. It is now the shared intelligence layer for all 31 agents.

Architecture

Every fact is stored as a triple: subject → predicate → object. Every triple has:

  • Confidence (0.0-1.0) — how certain the source agent was
  • Temporal fieldsvalidAt, invalidAt, expiredAt — when the fact became true, when it stopped being true
  • Audit statuspending-audit, verified, rejected — whether simulation outputs have been checked
  • Source agent — which agent added this fact
  • Confirmation list — which other agents have independently confirmed it

Domain ontology

10 entity types: Lead, Company, Competitor, Product, Segment, Campaign, Content, Objection, Strategy, Market. 16 relationship types: WORKS_AT, INTERESTED_IN, OBJECTED_WITH, CONVERTED_VIA, COMPETES_WITH, TARGETS_SEGMENT, and more.

Entities are prefixed by type: lead:john@acme.com, company:acme, competitor:coursera. This makes typed queries possible: "Find all Companies where at least one Lead converted."

Maintenance

Brain Maintenance runs automatically:

  1. Deduplication — MinHash shingling with Jaccard similarity >= 0.9. "Coursera" and "coursera.com" merge into one entity.
  2. Contradiction resolution — When two facts conflict (same subject + predicate, different objects), the newer fact supersedes the older. The old fact gets an invalidAt timestamp.
  3. Community detection — Label propagation clusters related entities into topic groups. Useful for discovering themes agents have not explicitly connected.
  4. Staleness pruning — Facts older than 30 days without confirmation are flagged for refresh.

Cost optimization

Running 31 agents is expensive if you do it naively. We have reduced costs by an estimated 85-95% through five techniques.

Model routing

70% of agent tasks — classification, data extraction, template application — run on Haiku (2x the throughput at 3-15x less cost). Complex multi-step reasoning runs on Sonnet. Opus is reserved for the hardest problems.

Deterministic pre-filtering

The blueprint system means the LLM only fires for creative subtasks. Seven of nine steps in the outreach blueprint are deterministic. Zero tokens spent on deciding what to do next — the code already knows.

Research validates this: Brain.co found that deterministic pipelines beat agentic wandering by 9x speed, 1/5th the tokens, and 50% higher accuracy.

Prompt caching

Claude's prompt cache has a 5-minute TTL. Cached tokens do not count toward rate limits. With an 80% cache hit rate, the effective token throughput becomes 5x the stated limit. We structure prompts so the system prompt (which rarely changes) sits in the cacheable prefix.

Smart scheduling

Agents only run when there is work to do. shouldAgentRun() checks whether new data exists since the last run. If nothing changed, the agent skips. This eliminates 50-90% of wasted runs.

Result truncation

Tool outputs are trimmed to what the agent actually needs. A CRM query that returns 50 fields gets reduced to the 8 relevant ones before entering the context window. Research confirms: "Simple memory — retaining only observations and actions — achieved best performance while minimizing tokens."


What failed and what we learned

Tool paralysis (fixed by blueprints)

Giving agents 35 tools and hoping for the best produced 5% compliance. The fix was not better prompts — it was removing the decision from the LLM entirely. Deterministic orchestration enforces the sequence.

Empty knowledge systems (fixed by deterministic writes)

When knowledge graph writes were optional LLM actions, they never happened. Now they are deterministic steps in the blueprint. Every outreach run writes to the knowledge graph. Every campaign run logs outcomes. Zero tokens spent — it is code, not an LLM call.

Level 0 team autonomy (fixed by protocol injection)

Agents were individually competent but collectively silent. Three root causes:

  1. Blocked simulation gates — sim-director only looked for status="new" leads, but 133 of 133 leads were already contacted. Fixed by removing the status filter.
  2. Nobody used react_to_signal — we told agents about the tool but none called it. Fixed by adding a Team Protocol section to the top of every agent's context.
  3. Talkspace buried in footer — urgent signals appeared at the bottom of the prompt where LLMs pay least attention. Fixed by injecting talkspace digest right after the system state, at primacy position.

Memory quality crisis (fixed by filtering)

The learning store auto-promoted patterns from episodic memory. Problem: 100% of outreach agent's 44 lessons were trivial patterns like "check_contact tool succeeds 100% of the time." The knowledge graph had 1,465 triples but 1,048 were operational metadata ("agent:outreach called tool:send_email").

Fixed by rejecting patterns above 95% success rate, filtering 30+ trivial infrastructure tools, and only promoting patterns in the 40-95% success range. Result: 247 garbage lessons purged, 1,068 operational triples removed.


OSINT and intelligence pipeline

Every lead that enters the system goes through a multi-stage intelligence pipeline before any agent writes a single email. This is not enrichment — it is reconnaissance.

Parallel enrichment architecture

The moment the system identifies a potential lead, it spawns multiple simultaneous processes. Domain lookup, social media scan, news aggregation, historical financial records — all in the same millisecond. No sequential waterfall. No waiting for one source to finish before the next begins.

The 2Pass research architecture

Pass one is the wide scan. The agent walks around the room to understand the layout — company size, industry, tech stack, basic contact info. If the baseline fits our ICP parameters, it initiates pass two: a hyper-focused deep dive. Behavioral analysis, content sentiment, hiring patterns, competitive positioning. Pass one costs almost nothing. Pass two only fires for qualified targets.

Selector extraction

A selector is a digital breadcrumb — an email structure, a username pattern, a phone number, a custom domain. The agents use JSON-driven rules and regular expressions to automatically extract these selectors from web pages. They are not looking at the visual layout of a website. They communicate directly with the underlying data structures, like reading the architect's blueprint instead of looking at a photograph of the building.

Social fingerprint scoring

When an agent finds the same username on multiple platforms, it does not assume it is the same person. It compares posting times, language patterns, linked repositories, and geographic signals. If a TechForum account and a Reddit account both post about Python algorithms during Central European time and link to similar GitHub repositories, the match score approaches 99%. If the same username appears on a photography blog posting exclusively in Portuguese from a Brazilian IP block, the score plummets.

13 specialized OSINT tools

We built 13 tools for agent-driven intelligence gathering. Ten of them operate at zero marginal cost.

ToolWhat it does
Enumerate UsernamesChecks 400+ social networks for username matches
Detect IntegrationsMaps a company's entire marketing and CRM stack from public DNS records
GetSiteHistoryQueries Wayback Machine for historical snapshots of target websites
CheckInfoStealerChecks if a prospect's credentials appear in malware-stolen databases
CheckMailSecurityTests 13 DKIM cryptographic selectors, assigns A-F security grade
GetTrafficRankPulls Cisco Umbrella and Quantcast rankings for traffic estimation
CheckBreachesQueries HaveIBeenPwned for breach history — reveals which platforms they use
FanOutPersonMaster tool: feed it a name, it fires web searches, username enumeration, social profiles, InfoStealer checks, and tech stack detection simultaneously
Harvest ContactsBreadth-first crawl of up to 100 pages per domain, extracts every email, phone, and social link
Find Social ProfilesCross-references Gravatar, EmailRep, and platform-specific APIs
Detect TechnologyReads HTTP headers, SSL certificates, and DNS records to map the full tech stack
WhoisLookupDomain registration, expiry, registrant data
Fingerprint Tech StackSPF records reveal the entire marketing stack — one DNS record shows if they use SendGrid, HubSpot, Zendesk

Change detection

The agents take a cryptographic hash of a target's web page. If even a single character changes days or weeks later, the hash changes and triggers an alert. The agent then runs differential text analysis to identify exactly what was altered. Three new job listings for back-end data engineers? The tripwire fires and the outreach agent gets a signal.


Communication pipelines in depth

The four communication mechanisms described above — talkspaces, knowledge graph, event bus, episodic memory — are more complex than they first appear.

Talkspace structured signals

Agents do not chat casually. Every message is a structured signal with mathematical urgency levels and confidence scores. Think of it like air traffic control — a pilot cannot radio in and casually say "I'm coming in to land." They give specific coordinates, altitude, approach vector. Nine dedicated channels: #strategy, #outreach-review, #competitive-intel, #content-collab, #agent-learning, #simulation, #market-intel, #signals, and a general channel.

Convergence detection

When agents debate a strategy in a talkspace, a convergence detection algorithm analyzes their confidence scores in real time. It calculates the exact moment the group has reached sufficient consensus — then forces them to agree and get back to work. No endless meetings. No circular debates.

Contact fatigue model

A Python machine learning model predicts when a specific lead is getting annoyed. It does not just count days since last contact. It analyzes the lead's open rates, time since their last click, and the linguistic sentiment of any previous replies. It calculates a burnout score. If the score exceeds the threshold, the agent backs off — even if the blueprint says it is time for the next email.

Grounding protocol

When an agent communicates, it must pull strictly from its own domain data. The prompts force a specific output structure: "Based on my domain data, I recommend X because of data point Y." If the underlying scripts cannot verify the data source in the agent's payload, the recommendation is blocked from being posted. No hallucinated intelligence enters the talkspace.

Interagent task delegation

If the growth analyst needs competitor pricing, it calls request_agent_task targeting the competitor watch agent with a specific JSON payload: action required, get current pricing for competitor X, urgency high. That request is logged on the event bus and acts as an interrupt — the competitor watch agent picks it up on its next run.

Control group (scientific method)

The system automatically holds out a 10% control group. These leads intentionally bypass the standard AI outreach cadence so the system can measure the true baseline effectiveness of its campaigns. The agents are autonomously applying the scientific method to their own sales pipeline.


The knowledge vault

The knowledge system is not a black box vector database. It is built on human-readable, Obsidian-compatible markdown files.

Why plaintext

Most AI memory systems use vector databases — ChromaDB, Pinecone, Weaviate. The problem: you cannot open a vector embedding and read it. You cannot diff it. You cannot grep it. When something goes wrong, you are debugging a black box.

We chose the OpenClaw pattern: markdown files paired with a lightweight SQLite index. Every memory is a .md file you can open in Obsidian, VS Code, or a text editor. The SQLite index enables hybrid search — BM25 keyword matching combined with graph BFS traversal. The system scans 10,000 chunks in under 100 milliseconds.

Relationship mapping via frontmatter

Plaintext does not natively support relationships. The solution is YAML front matter at the top of each file that explicitly declares structured relationships: relationships: Alice → works-in → Project Beta. The parser constructs a local graph in memory on the fly. The relationships are human-readable in the file AND machine-queryable through the index.

Memory composition scoring

When an agent queries its memory, the system calculates a composite score:

  • 50% semantic relevance — how closely the memory matches the query
  • 30% recency — exponential decay with a 30-day half-life
  • 20% importance — assigned when the fact was first extracted

A memory that is not flagged as highly important and has not been accessed recently will naturally degrade in retrieval priority. The system forgets what does not matter.

Incremental processing

The system monitors file hashes and only reprocesses markdown files that have been modified or added since the last compute cycle. It never re-indexes the entire vault. This keeps maintenance cost near zero even as the knowledge base grows.


The CEO dashboard

Running 31 agents is meaningless if you cannot see what they are doing, intervene when they are wrong, and direct their work.

Admin war room

The admin dashboard shows every agent's health, last run time, daily budget usage, recent actions, and error count. You can enable or disable any agent with one click. You can see the full talkspace feed, filter by channel, and read every signal and reaction.

The dashboard includes:

  • Agent roster — health status, run history, action counts for all 31 agents
  • Talkspace feed — real-time signal flow across all 9 channels
  • Knowledge graph explorer — browse entities, triples, communities
  • CRM view — leads, contacts, outreach records, conversion pipeline
  • Market intelligence — prediction market feeds, competitive intel, trend radar
  • Simulation space — active simulations, strategy proposals, vote results, trust scores

WhatsApp command interface

A custom WhatsApp integration powered by the Baileys protocol gives the CEO a direct encrypted line to the agent system. No monthly fees, no API limits — a persistent WebSocket connection running 24/7.

Commands include:

  • Text agent — receive a numbered list of all agents currently online
  • Text marketing — get a synthesized briefing from the marketing strategist
  • Send a voice note — Whisper transcribes it, the system parses intent, and translates it into commands
  • Text run sim-director — trigger a specific agent run on demand

Voice meetings with agents

Using 11Labs voice synthesis, each agent has a distinct human voice. You can have a morning stand-up where your VP of sales debates your marketing strategist while a background process extracts structured action items and capability gaps from the transcript.

Commitments made during meetings are tracked automatically. When you give verbal approval for a direction, the system logs it as a hard commitment and distributes the workload across the event bus.

5-minute heartbeat cycle

The entire system is governed by a strict 5-minute heartbeat. Every 300 seconds, each agent checks in. If an agent misses its heartbeat, it is flagged as degraded. If it misses three consecutive heartbeats, it is marked as errored and the owner is notified. No agent runs indefinitely. No agent goes rogue.


What the agents actually said: real talkspace logs

The most interesting thing about autonomous agents is not what they are designed to do — it is what they discover on their own.

The tutorial trap

The competitor gap agent analyzed every major competitor — Coursera, Skool communities, AI.nl — and arrived at a conclusion no human had explicitly programmed: all competitors teach people about AI, but none of them actually provide a platform to deploy AI agents that run a business. The agent called this "the tutorial trap" and built an entire competitive positioning strategy around it.

It validated the thesis by scanning GitHub. It found zero repositories with 100+ stars for no-code AI agent business operator deployments. DeFi has 131,000 stars. But the agent correctly recognized that DeFi requires complex technical setup — it is not the same market.

The Louis Mala debate

The marketing agent proposed a scarcity-heavy outreach strategy for a C-suite lead. sim-challenger red-teamed it, identified that scarcity tactics do not work on experienced executives who receive 50 cold emails a day, and issued a chaos veto with an 88/100 vulnerability score. The email was blocked from sending.

The system then pivoted to a reciprocity-first approach — offering genuine value upfront instead of manufacturing urgency. This is the kind of strategic correction that normally requires a senior sales manager reviewing every email.

The 0% open rate crisis

93 emails sent to real prospects. Open rate: 0%. Reply rate: 0%. This triggered 72 distinct escalation alerts and spawned six separate remediation tasks across three agents. The system did not ignore the failure. It treated it as a crisis, diagnosed likely causes (deliverability, subject lines, send timing), and generated specific fixes — all without human intervention.

Cold start mode

With near-zero real subscribers, the sales pipeline agents are effectively shadowboxing. They process hypothetical leads, prepare templates, score personas, and run simulations. They processed one actual real lead during the observation period — and the system deemed the lead mathematically too low to pursue. The agents are Olympic athletes sitting in a locker room, warming up. When the door opens, they explode off the starting line.


The road ahead: 2027 and beyond

Scaling to 300 agents

Now that token costs have been reduced by 60-75%, the compute bottleneck for 30 agents is largely solved. The architecture supports horizontal scaling. 300 agents could create entirely new hyper-personalized curriculums, grade complex assignments in real time, and provide one-on-one AI mentorship for every student simultaneously.

Agent self-optimization

With freed-up compute cycles, the agents have the structural capability to design, test, and train their own system optimizations. sim-inventor already creates new skill scripts to fix failure patterns. The next step is agents optimizing their own blueprints — adjusting turn budgets, refining tool selections, and pruning inefficient steps without human prompting.

AI-to-AI business operations

When this virtual company needs to negotiate a B2B deal with another company that also runs an agent system, what happens? Do their event buses talk to each other? Do their pipelines temporarily merge to negotiate a contract? Entire virtual companies negotiating and acquiring each other at the speed of code — parsing entity triples and running Monte Carlo simulations against each other.


Key takeaways

  • 31 agents, 4 groups: 15 working, 10 simulation, 5 ad ops, 1 market intelligence. Each with a daily action budget enforced by the registry.
  • Blueprints over prompts: Deterministic orchestration solved the tool compliance problem that better prompts could not. The LLM does creative work. The code enforces the sequence.
  • Simulation before action: Every high-value lead gets 3 pre-computed strategies, adversarial red-teaming, and a trust verification gate before any email sends.
  • Knowledge graph is the shared brain: 1,100+ triples with temporal validity, automated deduplication, contradiction resolution, and community detection.
  • OSINT pipeline runs reconnaissance, not enrichment: 13 specialized tools, parallel 2Pass architecture, social fingerprint scoring, change detection — all before the first email.
  • Communication is structured: Talkspaces with scored signals, convergence detection, grounding protocol, fatigue modeling, and a 10% control group for scientific measurement.
  • The knowledge vault is human-readable: Obsidian-compatible markdown files with SQLite hybrid search, composite memory scoring, and incremental processing.
  • The CEO runs it like an OS: Admin dashboard, WhatsApp command interface, voice meetings with 11Labs synthesis, 5-minute heartbeat cycle.
  • The agents discover things on their own: The tutorial trap, the GitHub operator gap, adversarial email vetoes, the 0% crisis remediation — all emergent behavior, not programmed.
  • Cost is controllable: Model routing, deterministic pre-filtering, prompt caching, smart scheduling, and result truncation combine to reduce costs by 85-95%.
  • Failures are the curriculum: Tool paralysis, empty knowledge systems, zero team autonomy, and garbage memories all happened. Each failure produced a specific architectural fix that made the system stronger.
  • The road ahead: 300-agent scaling, self-optimizing blueprints, and AI-to-AI business negotiation are architecturally possible with the current system.