Multi-agent systems — Quick Intro

INTROBLOCK · 01

MULT · 7 MIN PREVIEW

When one agent isn't enough.

The MAST study (NeurIPS 2025) measured a 17.2× error amplification when teams just throw N LLMs at a problem. Multi-agent done right cuts cost and improves accuracy. Multi-agent done wrong is the most expensive way to build a chatbot. This trailer shows the difference.

CONCEPTBLOCK · 02

The one-line difference

A single agent is one model in a tool-calling loop. A multi-agent system is N specialised agents plus a coordination contract. The contract — router, supervisor, pipeline, swarm, debate, blackboard, or graph — is what makes it a system instead of a chat group. If the only thing connecting your 'agents' is shared chat history, you don't have a multi-agent system. You have a chatbot pretending to be a team.

TIPAlways start single-agent. Promote to multi-agent only when measured eval lift exceeds the added cost — not because it's trendy.

WATCH OUTMAST taxonomy: unstructured 'bag of agents' amplifies errors 17.2× vs single-agent. Pick a pattern before writing code.

GOTCHAAn agent without a recursion cap, calling another agent without a recursion cap, will bankrupt you. The first guardrail you write is depth.

DIAGRAMBLOCK · 03

Router → 3 specialists

Router picks ONE specialist per turn — never broadcast. That's the cheapest, most predictable starting pattern.

CODEBLOCK · 04

A 12-line router with two specialists

PYTHON

1from openai import OpenAI

2client = OpenAI()

4def route(question: str) -> str:

5 r = client.chat.completions.create(

6 model="gpt-4o-mini", # cheap classifier

7 messages=[{"role": "system",

8 "content": "Reply 'code' or 'research' only."},

9 {"role": "user", "content": question}])

10 return "code" if "code" in r.choices[0].message.content.lower() else "research"

12def run(q, agents):

13 return agents[route(q)](q)

15agents = {"code": code_agent, "research": research_agent}

16print(run("fix the off-by-one in main.py", agents))

Line 4-6: routing uses a SMALL model. Line 12: dispatch is a 1-line lookup. The savings come from the cheap classifier — not from running more agents.

CHEATSHEETBLOCK · 05

The 5 rules every 2026 multi-agent shipper knows

01Start single-agent. Split only when the eval lift > the added cost.

02Pick a pattern (router, supervisor, pipeline, swarm, debate, blackboard, graph) BEFORE coding.

03Pydantic-typed messages between agents. Free text between agents is a smell.

04Cap recursion depth and per-agent token budget. Hard-fail past the cap.

05Eval the team end-to-end (MASEval / Braintrust), not each agent in isolation.

MINIGAME · RAPIDFIRETFBLOCK · 06

Quick check — true or false?

Multi-agent always beats single-agent on cost.

CLAIM 1/5 · READY · scroll into view

CONCEPTBLOCK · 07

What you'll ship in the full study

Eight lessons. Seven docker projects. By the end you'll have: — A LangGraph 1.0 supervisor team running locally with Langfuse traces. — An OpenAI Agents SDK hand-off triage system you can paste into a real support repo. — A Redis-backed blackboard pattern with structured Pydantic handoffs. — A pytest team-level eval harness that catches deadlock and role-drift in CI. — A fully air-gapped local stack on gpt-oss-20b for regulated industries. Every docker project is meant to be lifted into your real work — not a demo.

INCLUDEDEach project ships with composeYaml, expectedStdout, and a 'lift to work' note explaining how to drop it into your team's repo.

LESSON COMPLETEBLOCK · 08

That's the trailer.

NEXTLesson 1 · When to split into many agents

WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

Diagnose when single-agent is failingWorking
Use the 3-signal test (tool count, cost/latency split, trust boundaries) plus measured eval lift to decide IF multi-agent is worth the cost — before writing any code.
Pick a coordination pattern from 7 canonical optionsProduction
Router, supervisor, hierarchy, pipeline, hand-off/swarm, debate, blackboard, graph — recognise each, know the trade-offs, ship the right one.
Build a LangGraph 1.0 supervisor teamProduction
Typed StateGraph, conditional edges, durable checkpoints, langgraph-supervisor library, Langfuse integration end-to-end.
Implement OpenAI Agents SDK hand-offsProduction
Triage agent → specialist agents using the Agents SDK hand-off primitive with guardrails, sandboxes, and tracing on by default.
Ship a Redis-backed blackboard patternWorking
Pydantic schemas + Redis scratchpad with race-condition tests; safe parallel fan-out + fan-in.
Wire MCP and A2A for cross-stack interopWorking
MCP server for tool sharing, A2A signed Agent Cards for cross-framework agent calls — pair them like REST + JWT.
Write team-level evals (MASEval / Braintrust)Production
Golden traces, regression suite gating CI, metrics: context-reuse rate, contradictory-output rate, decision-sync time, p95 latency.
Defend against multi-agent failure modesAdvanced
Detect & prevent the 14 MAST modes: deadlock, infinite loops, role drift, prompt-injection cascade, recursion explosion.
Per-agent observability + cost attributionProduction
Tag every LLM call with (trace_id, agent_id, parent_agent_id, tool); Langfuse / Phoenix / Weave dashboards over per-agent spend.
Run a fully air-gapped multi-agent stackAdvanced
gpt-oss-20b via Ollama + smolagents + Letta + Redis + Prom/Grafana — the deployment regulated industries actually buy.

Career & income delta

Career moves

Title yourself credibly as 'AI agent engineer' or 'agent platform engineer' — the 2026 hiring channel for senior IC roles at $200-400K.
Lead a multi-agent platform team — the AI infra team most series-B/C companies are now staffing.
Pick up contracting work at $200-400/hr fixing teams whose 'bag of agents' is amplifying errors 17×.
Ship the multi-agent feature your CTO has been asking about for 6 months — and own that line item on your perf review.

Income impact

$20-50K bump for senior ICs adding production multi-agent to their resume in 2026.
$50-150K bump moving from a generic backend role to an agent-platform team.
Freelance / consulting rates: $200-400/hr — 'we have a multi-agent system we can't ship' is the most common 2026 inquiry.
Enterprise demos / sales-engineering: closing one 6-figure deal per quarter often requires the team-eval harness in this course.

Market resilience

Multi-agent is the skill that survives the next foundation-model consolidation — orgs always need someone who knows HOW to ship them safely.
MCP and A2A are now Linux Foundation standards; protocol fluency is durable across model providers.
Observability + eval discipline carries forward to whatever the 2027 framework du jour is.
Air-gapped/on-prem deployment skills (gpt-oss + Ollama) remain in demand for any regulated industry, no matter the model market.