Building Claude Code Skills
LLMs are bad at math and deterministic computation. Skills solve this by wrapping Python scripts that agents can call. Each skill has a SKILL.md definition and pure Python scripts that take JSON in and return JSON out.
Why skills exist
LLMs are terrible at deterministic computation. Ask Claude to score a lead based on 9 weighted features with specific thresholds and a sigmoid normalization curve, and it will get it wrong. Not because it's stupid -- because floating-point math, consistent rule application, and repeatable numerical outputs aren't what neural networks are designed for. They predict the next token. Sometimes that token happens to be the right number. Often it doesn't.
Try this yourself: give Claude a table of 50 leads with visit counts, pricing page views, time-on-site, referrer source, and device type. Ask it to score each one using a specific weighted formula. You'll get inconsistent results. It'll round differently across rows. It'll forget to apply a cap. It'll silently change a weight halfway through. This isn't a bug -- it's a fundamental architectural mismatch between what you're asking and what the model does.
Skills solve this by separating the two things LLMs are good and bad at:
- LLM handles reasoning: deciding when to score a lead, which data to pass in, and what to do with the result
- Python handles computation: the actual math, the weights, the thresholds, the normalization
The agent calls a Python script via an MCP tool. JSON in, JSON out. The script is deterministic -- same input always produces the same output. The agent never touches the math.
This is the fundamental pattern: LLM for reasoning, scripts for computation.
The skill pattern
Every skill follows the same directory structure:
.claude/skills/
lead-scoring-engine/
SKILL.md # Definition: what it does, inputs, outputs
scripts/
score_lead.py # Pure function: JSON in -> JSON out
Two files. That's it.
SKILL.md tells the agent what the skill does and how to use it. It's the interface documentation -- the contract between the agent and the script. When an agent reads your SKILL.md, it should know exactly what data to pass and what it'll get back.
scripts/*.py are the actual implementations. The rules are strict:
- Pure Python stdlib only -- no pip dependencies.
json,sys,math,datetime,collections,re-- all fair game. Nonumpy, nopandas, norequests. - Accept JSON string as first CLI argument -- the agent passes data as a JSON string via
sys.argv[1]. - Print JSON to stdout -- the result goes back to the agent as a JSON object.
- Exit 0 on success, non-zero on error -- standard Unix conventions.
- No side effects -- pure functions. Don't write files, don't call APIs, don't modify state. (The few exceptions, like
data-pruner, are clearly documented.)
Why stdlib only? Because skills need to run anywhere. No virtualenv setup, no dependency hell. If Python 3 is installed, your skill works. Period.
Anatomy of a SKILL.md
Here's the real SKILL.md from the lead-scoring-engine -- the most-used skill in the system:
---
name: lead-scoring-engine
description: Score leads numerically based on weighted behavioral signals.
Use this whenever you need to prioritize visitors for outreach or assess
lead quality. Auto-triggers when processing visitor lists.
user-invocable: false
tags: [outreach, scoring, computation]
---
# Lead Scoring Engine
Scores visitors 0-100 based on weighted behavioral features. Returns
score, tier, and top contributing factors.
## Usage
\`\`\`bash
run_skill_script skill="lead-scoring-engine" script="score_lead.py" args='{"visits": 5, "pricingVisits": 2, "timeOnSite": 180, "pagesViewed": 8, "returningVisitor": true, "referrer": "google", "device": "desktop"}'
\`\`\`
## Input Fields
- `visits` -- total page visits
- `pricingVisits` -- visits to pricing page
- `timeOnSite` -- seconds on site
- `pagesViewed` -- unique pages viewed
- `returningVisitor` -- boolean
- `referrer` -- traffic source (google, linkedin, reddit, direct, etc.)
- `device` -- desktop, mobile, tablet
- `checkoutVisits` -- visits to checkout page
- `blogPosts` -- blog posts read
## Output
Returns JSON: `{score, tier, factors, recommendation}`
- Tiers: hot (75+), warm (50-74), cool (25-49), cold (0-24)
Let's break down each part.
Frontmatter gives the system metadata. name matches the directory name. description is critical -- this is what the agent reads to decide whether to use the skill. Make it specific. "Score leads" is too vague. "Score leads numerically based on weighted behavioral signals" tells the agent exactly what this does and when to reach for it. The tags help with discovery when agents search for skills. user-invocable: false means this skill is only called by agents, not directly by humans.
Usage section shows the exact invocation syntax. This is the most important section. Agents are pattern matchers -- they'll copy this structure and swap in their own data. If your usage example is wrong or incomplete, agents will call the skill incorrectly.
Input Fields document every parameter. Include the type, the unit (seconds, not minutes), and valid values. Don't make agents guess.
Output describes the return shape. Agents need to know what fields they'll get back so they can use the result in their reasoning. The tier definitions (hot = 75+, warm = 50-74) let the agent make decisions without parsing the score itself.
How agents invoke skills
The run_skill_script MCP tool bridges agents and Python scripts. Here's the actual tool definition from skills-memory.ts:
{
name: "run_skill_script",
description: "Execute a Python script from a skill directory. Scripts are pure functions: JSON in -> JSON out.",
input_schema: {
type: "object",
properties: {
skill: { type: "string", description: "Skill directory name (e.g. 'lead-scoring-engine')" },
script: { type: "string", description: "Script filename (e.g. 'score_lead.py')" },
args: { type: "string", description: "JSON string to pass as argument to the script" },
timeout: { type: "number", description: "Timeout in seconds (default 30, max 120)" },
},
required: ["skill", "script"],
},
}
Four parameters. skill is the directory name. script is the filename inside scripts/. args is the JSON payload. timeout is optional -- defaults to 30 seconds, maxes out at 120.
Here's what happens when an agent calls this tool:
const projectRoot = path.resolve(__dirname, "../../../..");
const scriptPath = path.join(
projectRoot, ".claude", "skills", skillName, "scripts", scriptName
);
// Use skill-local venv if available, otherwise system Python
const venvPython = path.join(
projectRoot, ".claude", "skills", skillName, "venv", "bin", "python3"
);
const pythonCmd = existsSync(venvPython) ? venvPython : "python3";
const output = execSync(
`"${pythonCmd}" "${scriptPath}" '${scriptArgs}'`,
{ timeout, cwd: projectRoot, encoding: "utf-8" }
);
return output.trim(); // JSON result goes back to the agent
The execution flow:
- Agent decides it needs to score a lead (reasoning)
- Agent calls
run_skill_scriptwithskill="lead-scoring-engine",script="score_lead.py", and the visitor data as JSON - MCP server locates the Python script at
.claude/skills/lead-scoring-engine/scripts/score_lead.py - Server checks for a skill-local virtualenv first (for rare skills that need dependencies), falls back to system Python
- Script runs with the JSON args as
sys.argv[1] - Script prints JSON to stdout
- MCP server captures stdout and returns it to the agent
- Agent uses the result in its next reasoning step -- e.g., "Score is 82 (hot tier), so I'll draft a personalized outreach email with a pricing link"
The agent never does the math. It just decides when to ask for a score and what to do with the answer.
One important detail: error handling. If the script fails (bad input, timeout, exception), the MCP server catches the error and returns a JSON error object:
catch (err) {
return JSON.stringify({
error: `Script failed: ${e.stderr || e.message || "unknown error"}`
});
}
The agent gets back an error message it can reason about -- retry with different data, skip the lead, or ask for help.
The complete skill catalog
Here are all the production skills, grouped by function. Each one follows the same SKILL.md + Python script pattern.
Lead Intelligence
| Skill | What it does |
|---|---|
lead-scoring-engine | Scores visitors 0-100 on weighted behavioral signals (pricing visits, time on site, referrer, etc.) |
visitor-behavior-analysis | Classifies browsing patterns into intent types: careful_reader, price_checker, feature_explorer, comparison_shopper |
reply-classification | Classifies email replies by intent (interested, question, objection, unsubscribe) and recommends response strategy |
Predictive Analytics
| Skill | What it does |
|---|---|
churn-prediction | Predicts churn probability from engagement signals. Includes a trainable model -- feed it labeled outcomes to improve |
send-timing-optimizer | Calculates optimal email send time based on timezone, historical open patterns, and recipient segment |
cohort-analysis | Groups users by signup period, computes retention matrices, identifies critical drop-off points |
Content & Campaign
| Skill | What it does |
|---|---|
email-ab-testing | Chi-squared significance testing for A/B email experiments. Tells you if a winner is real or noise |
email-ab-optimizer | Analyzes campaign data to recommend optimal subject patterns, tones, send windows, and angles |
content-performance-analyzer | Scores content by engagement metrics, identifies winning topics/formats, computes decay rates |
campaign-performance-tracker | Aggregates per-campaign metrics, ranks campaigns, identifies what's failing and why |
ad-performance-optimizer | ROAS calculation, diminishing returns detection, budget reallocation across ad campaigns |
Competitive Intelligence
| Skill | What it does |
|---|---|
trend-detection | Time-series momentum analysis -- detects rising/declining topics and their lifecycle stage (emerging, peaking, dead) |
sentiment-tracker | Tracks sentiment ratios across sources over time, detects shifts in how people talk about a topic or brand |
competitor-page-diff | Compares two versions of a competitor page to detect pricing, feature, and messaging changes |
partner-fit-scorer | Scores partnership candidates on audience overlap, complementary value, reach, and engagement |
System Maintenance
| Skill | What it does |
|---|---|
cross-agent-intelligence | Synthesizes data across all agent stores into cross-cutting strategic insights no single agent can see |
data-pruner | Scans data stores and archives/deletes stale records based on configurable TTL rules |
memory-consolidator | Deduplicates agent memory entries using Jaccard similarity, promotes/demotes based on age and usage |
feedback-hygiene | Applies EWMA time-weighting to feedback data, removes noise strategies with too few samples |
intelligence-freshness | Checks timestamps on competitive intel and flags stale items past their freshness TTL |
adaptive-feedback-loop | Records action outcomes and updates strategy weights -- the self-improving feedback system |
Research & Prospecting
| Skill | What it does |
|---|---|
company-deep-research | Deep-researches a company by domain -- scrapes careers pages, detects pain signals, estimates team size |
linkedin-prospector | Browser automation for LinkedIn discovery -- search, view profiles, extract data, guess emails (uses local venv) |
That's 23 skills total. Most are pure stdlib Python. Two exceptions: linkedin-prospector uses a local virtualenv with nodriver for browser automation, and company-deep-research uses stdlib urllib for basic HTTP.
Tutorial: Write your first skill
Let's build an "email-subject-scorer" skill from scratch. It'll score email subject lines for predicted open rate effectiveness based on length, power words, personalization signals, and spam triggers.
Step 1: Create the directory structure
mkdir -p .claude/skills/email-subject-scorer/scripts
Step 2: Write the SKILL.md
---
name: email-subject-scorer
description: Score email subject lines for effectiveness. Evaluates length,
power words, personalization, and spam triggers. Use this before sending
any outreach email to optimize the subject line.
user-invocable: false
tags: [email, scoring, outreach]
---
# Email Subject Scorer
Scores subject lines 0-100 for predicted open rate effectiveness.
## Usage
\`\`\`bash
run_skill_script skill="email-subject-scorer" script="score_subject.py" args='{"subject": "Quick question about your AI strategy"}'
\`\`\`
## Input
- `subject` -- the email subject line to score
## Output
Returns JSON: `{score, factors, suggestions}`
Notice the description says when to use the skill ("before sending any outreach email") -- not just what it does. This is what makes agents reach for it at the right moment.
Step 3: Write the Python script
#!/usr/bin/env python3
"""Score email subject lines for effectiveness."""
import sys
import json
POWER_WORDS = {
"free", "new", "proven", "secret", "instant",
"exclusive", "limited", "urgent", "discover", "unlock"
}
SPAM_TRIGGERS = {
"buy now", "act now", "limited time", "click here",
"100%", "guarantee", "no obligation", "risk free"
}
def score_subject(subject: str) -> dict:
score = 50 # baseline
factors = []
suggestions = []
# Length scoring (ideal: 30-50 chars)
length = len(subject)
if 30 <= length <= 50:
score += 15
factors.append("Optimal length")
elif length < 20:
score -= 10
factors.append("Too short")
suggestions.append("Aim for 30-50 characters")
elif length > 60:
score -= 10
factors.append("Too long -- may get truncated on mobile")
suggestions.append("Shorten to under 50 characters")
# Power words
words = set(subject.lower().split())
power_matches = words & POWER_WORDS
if power_matches:
score += len(power_matches) * 5
factors.append(f"Power words: {', '.join(sorted(power_matches))}")
# Personalization signals
if any(token in subject.lower() for token in ["you", "your"]):
score += 10
factors.append("Personalized (uses 'you/your')")
else:
suggestions.append("Add 'you' or 'your' for personalization")
# Question format (drives curiosity)
if subject.strip().endswith("?"):
score += 8
factors.append("Question format (higher open rate)")
# Number in subject (specificity signal)
if any(char.isdigit() for char in subject):
score += 5
factors.append("Contains number (specificity)")
# Spam trigger check
lower = subject.lower()
for trigger in SPAM_TRIGGERS:
if trigger in lower:
score -= 15
factors.append(f"Spam trigger: '{trigger}'")
# ALL CAPS check
words_list = subject.split()
caps_words = [w for w in words_list if w.isupper() and len(w) > 1]
if caps_words:
score -= 5 * len(caps_words)
factors.append("ALL CAPS words detected")
suggestions.append("Avoid ALL CAPS -- it triggers spam filters")
score = max(0, min(100, score))
return {"score": score, "factors": factors, "suggestions": suggestions}
if __name__ == "__main__":
args = json.loads(sys.argv[1]) if len(sys.argv) > 1 else {}
result = score_subject(args.get("subject", ""))
print(json.dumps(result))
Every design choice here matters:
- Baseline of 50: a boring-but-fine subject line scores average. Points are added or subtracted from there.
- Capped at 0-100:
max(0, min(100, score))prevents overflow. - Factors list: the agent can see why the score is what it is, not just the number.
- Suggestions list: actionable -- the agent can rewrite the subject and re-score.
Step 4: Test it locally
python3 .claude/skills/email-subject-scorer/scripts/score_subject.py \
'{"subject": "Quick question about your AI strategy"}'
Expected output:
{
"score": 83,
"factors": [
"Optimal length",
"Personalized (uses 'you/your')",
"Question format (higher open rate)"
],
"suggestions": []
}
Test edge cases too:
# Too short
python3 .claude/skills/email-subject-scorer/scripts/score_subject.py \
'{"subject": "Hi"}'
# Spam trigger
python3 .claude/skills/email-subject-scorer/scripts/score_subject.py \
'{"subject": "BUY NOW -- Limited Time 100% Guarantee!!!"}'
The spam subject should score near 0. If it doesn't, your penalties aren't aggressive enough.
Step 5: Deploy it
There's no deploy step. Once the files exist in .claude/skills/email-subject-scorer/, any agent with access to the run_skill_script MCP tool can use it. The agent reads the SKILL.md to understand the interface, then calls the tool with the right parameters.
Here's what it looks like from the agent's perspective:
- Agent is drafting an outreach email
- Agent reads SKILL.md for
email-subject-scorerand sees "Use this before sending any outreach email" - Agent calls:
run_skill_script skill="email-subject-scorer" script="score_subject.py" args='{"subject": "Quick question about your AI strategy"}' - Gets back:
{"score": 83, "factors": [...], "suggestions": []} - Agent reasons: "Score is 83, no suggestions -- subject line is good. Proceeding with send."
If the score had been 45 with suggestions, the agent would rewrite the subject line and score it again. The skill creates a feedback loop the agent can use autonomously.
Skill design principles
After building 23 skills, these are the principles that matter.
Pure functions
No side effects. Same input always produces the same output. This isn't a nice-to-have -- it's what makes skills trustworthy. If your lead scoring script returns 72 today and 68 tomorrow for identical input, you've got a bug, not a feature. The whole point is that agents can rely on deterministic results.
The exception is skills that intentionally modify state (like data-pruner or memory-consolidator). These are clearly documented as such, and they all support a dry_run flag for safe testing.
Stdlib only
No pip dependencies. No requirements.txt. No virtualenv. If Python 3 is installed on the machine, the skill works. This constraint forces you to write lean, portable code.
You might think "but I need numpy for this calculation." You probably don't. The churn-prediction skill implements logistic regression in 30 lines of stdlib Python using math.exp. The email-ab-testing skill does chi-squared significance testing with math and collections. If you absolutely need an external library (like linkedin-prospector needs nodriver for browser automation), use a skill-local virtualenv -- the MCP tool checks for one automatically.
JSON interface
Always accept a JSON string as sys.argv[1]. Always print JSON to stdout. No other formats. No CSV. No YAML. No plain text. JSON is the universal agent interchange format -- every MCP tool speaks it, every agent can parse it.
if __name__ == "__main__":
args = json.loads(sys.argv[1]) if len(sys.argv) > 1 else {}
result = do_the_thing(args)
print(json.dumps(result))
This three-line pattern should be at the bottom of every skill script.
Fail loudly
Bad input should produce a clear error, not a silent wrong answer. If someone passes a string where you expect a number, don't silently convert it -- raise an error that tells the agent what went wrong.
if "subject" not in args or not isinstance(args["subject"], str):
print(json.dumps({"error": "Missing required field: subject (string)"}))
sys.exit(1)
The agent gets back an error message it can reason about. Maybe it passed the wrong field name. Maybe it forgot a required parameter. Either way, the error message should tell it exactly how to fix the call.
Document the contract
SKILL.md is the interface. If it's unclear, agents will misuse the skill. Every SKILL.md needs:
- A specific description that tells agents when to use the skill, not just what it does
- A complete usage example with realistic data that agents can pattern-match against
- Every input field documented with its type, unit, and valid values
- The output shape so agents know what fields they'll get back
Think of SKILL.md as an API doc for a reader who's smart but has never seen your code. That reader is the agent. It can reason about your documentation, but it can't guess what you left out.