Based on several months of running real multi-agent systems. Every pattern was added because something broke when it was missing.

One AI agent is useful. So you add a second, then a third, and things start to drift out of control:

  • Agent A and Agent B edit the same file at the same time
  • One API goes down, three agents retry aggressively, and the bill explodes
  • A sub-agent finishes, disappears quietly, and nobody knows whether it succeeded or failed

The core problem is usually the collaboration layer, not AI itself.

The five patterns below have been tested repeatedly in production-like workflows. Together, they let an AI team actually work as a team.


First, the Big Picture: What Does a Multi-Agent System Look Like?

A typical multi-agent system looks like this:

        🎯 Orchestrator

        ┌───────┼───────┬─────────┐
        ▼       ▼       ▼         ▼
   🔧 Executor A  🔧 Executor B  🔍 Reviewer   📁 Shared filesystem
        │       │           │
        └───────┴───────────┴──→ 📁 Shared filesystem

Key architecture traits:

TraitDescription
Hierarchical controlOne orchestrator coordinates everything
Stateless executorsReceive task, execute, report, exit
Files as communicationAll state lives in files
No lateral conversationExecutors do not talk directly to each other

Why this shape? Most AI tools currently provide stateless sub-agents. They do not have persistent memory or a built-in way to talk to other agents. The file system is the only reliable collaboration mechanism.


Pattern 1: File Blackboard

The orchestrator explains the workflow to small agents in front of a blackboard

All agents communicate through files, not messages. The file system is the message bus.

Problem

Three agents need to collaborate, but they:

  • Have no memory: every invocation is fresh
  • Are isolated from each other: no built-in communication channel
  • Can disappear at any time: sessions can drop without warning

Solution: Treat the File System as a Blackboard

workspace/
  status.md          ← Project status (only the orchestrator can change this)
  tasks/
    task-001.md      ← Task envelope (instructions for the executor)
    task-002.md
  artifacts/
    task-001-out.md  ← Executor output
    task-002-out.md
  checkpoints/
    checkpoint.json  ← Crash recovery

Access Rules

The core rule is simple: only the orchestrator can write global state. Executors can write only to their assigned output locations.

Why Files?

AlternativeProblem
Shared memoryEverything disappears when an agent crashes
Message queueOverkill for a single-machine setup
DatabaseMost AI tools are smoother with text files
Agent-to-Agent APIAlmost no platform supports this yet

Files are useful because they are inspectable (open and read them), durable (survive session drops), and versionable (put them in git and you get history).


Pattern 2: Task Envelope

A small penguin carefully puts a task into a neatly wrapped envelope

Do not just tell a sub-agent “do X.” Give it a structured task package.

Problem

Spawn a sub-agent and say “analyze this data,” and it may:

  • Analyze the wrong fields
  • Run 50 tool calls when 5 would have been enough
  • Finish with “analysis done” but not in the requested format

Solution: Package Every Task as an Envelope

## Task: analyze-q1-data
- **Goal**: Summarize Q1 revenue trends
- **Acceptance criteria**: Output includes trend direction, top 3 drivers, and confidence level
- **Input**: data/q1-revenue.csv (quarterly revenue by product line)
- **Output location**: artifacts/analyze-q1-data.md
- **Budget**: At most 5 tool calls
- **Stop condition**: If the data file is missing or malformed, stop and report back

What Belongs in the Envelope?

FieldWhat happens without it
GoalThe agent guesses what you want
Acceptance criteriaOutput quality is uncontrolled
InputThe agent may read the wrong file or scan the whole directory
Output locationNobody knows where the result went
BudgetThe agent explores forever and burns money
Stop conditionIt gets stuck on errors and does not report back

Lightweight vs Full Version

Not every task needs a full envelope. Simple rule:

  • Can finish within 3 steps → verbal instruction is enough
  • Touches external APIs or multiple files → use a lightweight envelope (goal + input + output)
  • High-risk or high-cost → full envelope, including budget and stop conditions

Pattern 3: Circuit Breaker

A penguin stands beside a large circuit breaker switch, cautious but in control

Track consecutive failures per API. Trip after three, wait, then probe once.

Problem

An agent calls an API and hits 429 (rate limit). It retries, then retries again. A second agent is also retrying, so now two agents are hammering the same API and making the rate limit worse. A third agent switches to the backup provider, and that gets overwhelmed too.

This is a Retry Storm, one of the fastest ways to burn budget in a multi-agent system.

Solution: Three-State Circuit Breaker

StateBehaviorTransition
CLOSED (normal)All calls proceed normally3 consecutive failures within 60 seconds → OPEN
OPEN (blocked)Calls fail immediately without executionWait 5 minutes → HALF-OPEN
HALF-OPEN (probe)Allow one call throughSuccess → CLOSED; failure → OPEN with doubled cooldown

Which Errors Should Trip the Breaker?

Not all errors are equal:

Error typeTrip breaker?Reason
429 (rate limit)Continuing only makes it worse
503 (service unavailable)The other side is down
Network timeoutConnectivity usually does not fix itself in one second
401 (auth failure)Configuration problem; fix the token instead of waiting
400 (bad request)A bug; retrying will not help

A Real Pitfall

Silent proxy failure: the API proxy went down, and every request returned the same generic error. Agents kept retrying because the error looked temporary.

Lesson: if the exact same error appears several times in a row, trip the breaker no matter what the status code is. Identical consecutive errors mean systemic failure.


Pattern 4: HITL Escalation

A penguin raises its flippers to call a distant human for help, connected by a golden dotted line

Knowing when to ask a human is one of the most important capabilities of an AI system.

Three Escalation Levels

LevelRiskAgent behaviorExample
🟢 AutonomousLow, reversibleDo it directlyRead files, write drafts, run tests
🟡 NotifyMedium, important milestoneDo it, then notify the humanCommit code, update status
🔴 GateHigh, irreversibleStop and wait for approvalPush to production, send messages, delete data

Key Metric: Escalation Rate

The ideal escalation rate is 10-15%:

  • Too low (< 5%) → AI is doing high-risk things without notification. Dangerous.
  • Too high (> 30%) → the human becomes the bottleneck and automation loses meaning.
  • 10-15% → most work is automated, important work still has human oversight.

Information Format for Escalation

When an agent escalates to a human, do not just say “uncertain.” Provide the full decision packet:

## Decision Needed

**Context**: Deploying a new version to production
**Issue**: All tests pass, but one warning is related to memory usage
**Options**:
  A. Deploy as usual (the warning has not caused problems historically)
  B. Fix the warning before deployment (estimated delay: 2 hours)
  C. Deploy to staging and observe for one day

**Recommendation**: Choose A, because [reason]
**Waiting for reply...**

Pattern 5: Model Selection

A penguin at a control panel decides which switch to pull: the large golden one or the smaller brown one

Use expensive models for thinking and cheaper models for execution. Reverse that and you waste money.

The Binary Rule

The core cost-control rule for multi-agent systems:

  • Needs judgment (strategy / analysis / review) → expensive model
  • Does not need judgment (writing code / formatting / scraping data) → cheaper model

Common Setup

RoleModel tierReason
OrchestratorExpensiveIt makes all important decisions. Saving here loses globally
Code writerCheapOutput can be validated with tests
Code reviewerExpensiveNeeds to catch what others missed
Data collectorCheapMechanical scraping + formatting
Data analystExpensiveInterpretation, pattern finding, judgment
Test runnerCheapResult is pass/fail; no creativity needed
Format converterCheapestPure data transformation

80/20 Cost Rule

Four changes save most multi-agent cost:

  1. Model tiering (expensive thinks, cheap executes) → biggest impact
  2. Pass paths, not content (let the agent read files itself) → saves tokens
  3. Compress completed phases (keep one-sentence summaries) → saves tokens
  4. Set budget caps in task envelopes → avoids outliers

Summary: How the 5 Patterns Relate

These patterns are not independent. They form one operating system:

PatternProblem solvedWithout it
File BlackboardHow agents communicateState conflicts, lost data
Task EnvelopeHow tasks are assigned clearlyUncontrolled quality, runaway budget
Circuit BreakerWhat to do when APIs failRetry storms, exploding bills
HITL EscalationWhen to ask a humanHigh-risk actions without oversight
Model SelectionWhich model to useAll-expensive wastes money; all-cheap collapses quality

Getting Started

Minimum viable order:

  1. Create a shared directory (workspace/status.md + tasks/ + artifacts/)
  2. Package the first task in an envelope (not the full version; goal + input + output is enough)
  3. Add a circuit breaker (track API failures in a JSON file)
  4. Decide which actions need human approval (start with anything affecting external systems)

After these four steps, you have a working multi-agent collaboration system.

For the full pattern docs, code templates, and more field examples, I packaged the material as an open-source repo:

👉 Orchestration Playbook on GitHub

MIT licensed. Use it freely.

Further Reading

Penchan’s Take

The Opus / Sonnet / ChatGPT three-agent setup on OpenClaw grew these five patterns one failure at a time. In the early versions, two agents edited the same file and the state got messy; when an API failed, three agents retried together and the bill jumped immediately. After shared state moved into the file system, tasks became envelopes, APIs got circuit breakers, and important operations went through a human gate, the whole system stabilized. Model tiering came later, using expensive models for command and cheaper ones for execution, and saved more tokens than expected.

FAQ

Q: What is AI Agent Orchestration?

Orchestration means a primary agent, the orchestrator, assigns tasks, collects results, and handles failures while other agents focus on their assigned work. Like a conductor, it makes sure every player performs the right part at the right time.

Q: Why can’t AI agents just talk to each other directly?

Most current AI tools, including Claude Code and Codex, create stateless sub-agents. They are spawned, run a task, report a result, and disappear. They have no persistent memory or built-in communication channel, so file-system based communication is the reliable workaround.

Q: What does a Circuit Breaker do in an AI system?

A Circuit Breaker tracks consecutive API failures. When failures reach a threshold, it stops calls for a while so all agents do not retry at the same time. This prevents retry storms, one of the most expensive failure modes in multi-agent systems.


— Penchan