After an AI Agent has been online for a while, you may see this situation: automated schedules are running, memory is updating, and Sub-agents quietly handle dozens of tasks every day. Everything looks stable.
Until one day you open the dashboard and discover a Sub-agent has been stuck for three hours. A user asked a question on Telegram. The Agent replied, “Let me check,” and then vanished. No alert, no notification. Someone just happened to notice.
The hardest part of an Agent failing is this: a crash nobody notices is harder to handle than the crash itself.
An even worse scenario: you set up an LLM monitor to see whether Agents are stuck, but the monitor starts hallucinating nonexistent tool calls, contributes most of the day’s error logs, and buries the real issue even deeper.
The three-layer self-healing architecture below is the version that remained after stepping on those traps.
1. Why Traditional Monitoring Misses Agent Problems
Traditional monitoring asks: “Is the process alive?”
AI Agents need a different question: “Is the Agent doing what it said it would do?”
This is a fundamental difference. AI Agents have four failure modes traditional monitoring cannot see:
- Silent stall: the process is running and HTTP responds, but the Agent has had no output for 20 minutes
- Unfulfilled promise: the Agent says “give me five minutes” and never comes back
- Monitoring backfire: the health-check LLM hallucinates nonexistent commands and becomes the source of errors
- Orphaned task: the session dies halfway, and nobody knows the task was interrupted
These situations need a layered monitoring system that understands semantic Agent behavior, not just a more detailed process monitor.
2. Three-Layer Protection: 5 Minutes → 15 Minutes → 60 Minutes
Use a three-layer architecture with roughly 3x frequency gaps. The lower the layer, the cheaper and more frequent it runs:
| Layer | Frequency | Tech | Cost | Detects |
|---|---|---|---|---|
| L1 | Every 5 minutes | Shell + Node.js | 0 | process liveness, HTTP, heartbeat liveness, promise detection |
| L2 | Every 15 minutes | Cheap LLM | very low | session alive/dead, stuck Sub-agents, abort detection |
| L3 | Every 60 minutes | Shell → LLM on demand | very low | orphan checkpoints, context overflow, log analysis |

Why not use a single approach:
| Approach | Result |
|---|---|
| Pure LLM (Gemini Flash) | very high error rate; hallucinated tool calls drown out real problems |
| Pure LLM (Haiku API) | too expensive; works, but not sustainable long-term |
| Pure Shell | cannot catch semantic stalls or unfulfilled promises |
| Shell + Cheap LLM + on-demand escalation | very low cost, each layer does what it is good at |
L1 also monitors L2. If the LLM Heartbeat has not produced HEARTBEAT_OK for more than 20 minutes, the L1 Shell script alerts. Monitoring systems need someone to monitor the monitoring system, and that “someone” must be zero-cost; otherwise nobody catches L2 when it dies.
3. Prohibition-First: Stop the Monitoring LLM from Making Things Worse
The painful lesson: give an LLM a monitoring task plus a pile of tools, and it will hallucinate nonexistent tool calls.
The first Heartbeat version used gemini-flash and produced errors like:
canvas failed: node required ×N
message failed: chat not found ×N(拿 Discord ID 當 Telegram chat_id)
exec failed: command not found: rss-tool ×N
edit failed: Missing required parameter ×N
Worse, one Sub-agent read maintenance script output that said “Restart gateway to apply changes,” and it actually ran gateway restart, causing an unexpected full-system restart.
The solution is Prohibition-First Prompt Design:
## 你只能用這些工具(白名單制)
✅ sessions_list:查看 session
✅ sessions_send:發通知
✅ subagents kill:終止卡住的 agent
✅ message:僅限通知頻道
❌ 其他所有工具都禁止:exec/read/edit/web_search/gateway 等。
---
## 一切正常時,只回覆:
HEARTBEAT_OK
Key design: put the allowlist before the task description. LLMs process prompts in order. If they read the task before the limits, they start planning before the constraints arrive. If they read the limits first, their planning space is constrained from the beginning.
This change brought Heartbeat errors close to zero.
4. Promise Detection: The Agent Makes a Promise and Disappears
This is the most distinctive part of the system.
Traditional monitoring does not read the Agent’s conversation. In practice, though, the most common “fake alive” pattern is: the Agent promises something, then goes silent.

promise-watchdog, a zero-dependency Node.js script, reads Agent transcripts and uses regex to detect promise phrases:
const promisePattern = /(
give me \d+ minutes?|be right back|let me check|
稍等|等等|給幾分鐘|現在就去|稍後回覆
)/i;
Detection logic:
- Pending-reply detection: if the last message is from the user and the Agent has not replied for more than 6 minutes → alert
- Unfulfilled promise: if the Agent’s last message matches the promise pattern and no new progress appears for more than 7 minutes → alert
The notification includes diagnostic context: “Gateway restarted 1 time during this period,” “no active Sub-agent currently visible.” In seconds, you can judge whether the Agent crashed, is busy, or simply forgot.
Repeated notifications are deduplicated with SHA1 signatures, so the same stall does not alert repeatedly within 20 minutes.
5. Checkpoint: Interrupted Tasks Do Not Have to Restart from Zero
Sessions die. That is a fact. API timeout, context overflow, process restart: there are many causes. The important question is: what happens after it dies?
Dual-track recovery strategy:
When a Checkpoint exists:
Before spawning a Sub-agent, the Orchestrator writes a checkpoint file containing the task name, current step, and completed steps. After a session dies, L3 scans hourly and finds the orphan checkpoint. If it qualifies, meaning recent enough, retry count not exceeded, and no human input required, it automatically spawns a new session to continue.
When no Checkpoint exists:
Step back and read the conversation history. Find the user’s last message, the original request, then respawn with partial results already present in the thread. This is the fallback, so even if nothing was prepared, the system can still recover itself.
Restart counting is stateless. Instead of using an external file to record “how many restarts happened,” count the 🏥 recovery markers in the thread. The conversation itself is the state store. Allow at most two automatic restarts; beyond that, notify a human.
Summary: Design Principles for Self-Healing Systems
- Cost is architecture: layering is a budget strategy; do not use an LLM for what Shell can do
- Prohibition first: in monitoring prompts for LLMs, write what they cannot do before what they should do
- No news is the goal: zero-noise principle. Most runs should return
HEARTBEAT_OK - Cheaper layers run more often: 5m → 15m → 60m with 3x gaps lets the zero-cost layer find issues first
- Monitor the monitoring system: L1 monitors L2 liveness. Monitoring itself can fail, and the layer catching it must be zero-cost
The most expensive LLM should wake up only when real judgment is needed. The rest of the time, Shell scripts and regex are enough.
Further Reading
- Multi-Agent Collaboration Guide
- AI Workspace Auto-Cleanup
- OpenClaw Multi-Agent Architecture
- Security Risks of AI Agents
Penchan’s Experience
My main multi-agent setup is OpenClaw with ChatGPT and Claude split across agents. This self-healing approach came from several rounds of stuck sub-agents and heartbeat monitors hallucinating tool calls. The most noticeable lesson is “monitor the monitoring system”: the first LLM heartbeat started calling random tools, and it really polluted the day’s error logs. In the end, shell + regex was what kept it under control.
FAQ
Q: Why does an AI Agent need self-healing?
AI Agents fail differently from traditional programs. They can be “alive but stuck.” The process runs and HTTP responds, but the Agent says “give me five minutes” and never returns. Traditional monitoring only checks process liveness and cannot catch this semantic stall.
Q: How much does a three-layer self-healing architecture cost?
It can be extremely low. The bottom layer is pure Shell scripts and free; the middle layer uses cheap models such as Gemini Flash; the top layer only runs hourly, and most issues are handled by Shell before any LLM call is needed.
Q: Can I use this without Claude Code?
Yes. The architecture is general. As long as your Agent system can output transcripts and provide a way to query session status, you can adapt it.
— Penchan