AI Agent Self-Healing: A Three-Layer Architecture So Your Agent Stops Dying Quietly

After an AI Agent has been online for a while, you may see this situation: automated schedules are running, memory is updating, and Sub-agents quietly handle dozens of tasks every day. Everything looks stable.

Until one day you open the dashboard and discover a Sub-agent has been stuck for three hours. A user asked a question on Telegram. The Agent replied, “Let me check,” and then vanished. No alert, no notification. Someone just happened to notice.

The hardest part of an Agent failing is this: a crash nobody notices is harder to handle than the crash itself.

An even worse scenario: you set up an LLM monitor to see whether Agents are stuck, but the monitor starts hallucinating nonexistent tool calls, contributes most of the day’s error logs, and buries the real issue even deeper.

The three-layer self-healing architecture below is the version that remained after stepping on those traps.

1. Why Traditional Monitoring Misses Agent Problems

Traditional monitoring asks: “Is the process alive?”

AI Agents need a different question: “Is the Agent doing what it said it would do?”

This is a fundamental difference. AI Agents have four failure modes traditional monitoring cannot see:

Silent stall: the process is running and HTTP responds, but the Agent has had no output for 20 minutes
Unfulfilled promise: the Agent says “give me five minutes” and never comes back
Monitoring backfire: the health-check LLM hallucinates nonexistent commands and becomes the source of errors
Orphaned task: the session dies halfway, and nobody knows the task was interrupted

These situations need a layered monitoring system that understands semantic Agent behavior, not just a more detailed process monitor.

2. Three-Layer Protection: 5 Minutes → 15 Minutes → 60 Minutes

Use a three-layer architecture with roughly 3x frequency gaps. The lower the layer, the cheaper and more frequent it runs:

Layer	Frequency	Tech	Cost	Detects
L1	Every 5 minutes	Shell + Node.js	0	process liveness, HTTP, heartbeat liveness, promise detection
L2	Every 15 minutes	Cheap LLM	very low	session alive/dead, stuck Sub-agents, abort detection
L3	Every 60 minutes	Shell → LLM on demand	very low	orphan checkpoints, context overflow, log analysis

Three-layer self-healing architecture

Why not use a single approach:

Approach	Result
Pure LLM (Gemini Flash)	very high error rate; hallucinated tool calls drown out real problems
Pure LLM (Haiku API)	too expensive; works, but not sustainable long-term
Pure Shell	cannot catch semantic stalls or unfulfilled promises
Shell + Cheap LLM + on-demand escalation	very low cost, each layer does what it is good at

L1 also monitors L2. If the LLM Heartbeat has not produced HEARTBEAT_OK for more than 20 minutes, the L1 Shell script alerts. Monitoring systems need someone to monitor the monitoring system, and that “someone” must be zero-cost; otherwise nobody catches L2 when it dies.

3. Prohibition-First: Stop the Monitoring LLM from Making Things Worse

The painful lesson: give an LLM a monitoring task plus a pile of tools, and it will hallucinate nonexistent tool calls.

The first Heartbeat version used gemini-flash and produced errors like:

canvas failed: node required              ×N
message failed: chat not found            ×N（拿 Discord ID 當 Telegram chat_id）
exec failed: command not found: rss-tool  ×N
edit failed: Missing required parameter   ×N

Worse, one Sub-agent read maintenance script output that said “Restart gateway to apply changes,” and it actually ran gateway restart, causing an unexpected full-system restart.

The solution is Prohibition-First Prompt Design:

## 你只能用這些工具（白名單制）

✅ sessions_list：查看 session
✅ sessions_send：發通知
✅ subagents kill：終止卡住的 agent
✅ message：僅限通知頻道

❌ 其他所有工具都禁止：exec／read／edit／web_search／gateway 等。

---

## 一切正常時，只回覆：
HEARTBEAT_OK

Key design: put the allowlist before the task description. LLMs process prompts in order. If they read the task before the limits, they start planning before the constraints arrive. If they read the limits first, their planning space is constrained from the beginning.

This change brought Heartbeat errors close to zero.

4. Promise Detection: The Agent Makes a Promise and Disappears

This is the most distinctive part of the system.

Traditional monitoring does not read the Agent’s conversation. In practice, though, the most common “fake alive” pattern is: the Agent promises something, then goes silent.

Promise detection flow

promise-watchdog, a zero-dependency Node.js script, reads Agent transcripts and uses regex to detect promise phrases:

const promisePattern = /(
  give me \d+ minutes?|be right back|let me check|
  稍等|等等|給幾分鐘|現在就去|稍後回覆
)/i;

Detection logic:

Pending-reply detection: if the last message is from the user and the Agent has not replied for more than 6 minutes → alert
Unfulfilled promise: if the Agent’s last message matches the promise pattern and no new progress appears for more than 7 minutes → alert

The notification includes diagnostic context: “Gateway restarted 1 time during this period,” “no active Sub-agent currently visible.” In seconds, you can judge whether the Agent crashed, is busy, or simply forgot.

Repeated notifications are deduplicated with SHA1 signatures, so the same stall does not alert repeatedly within 20 minutes.

5. Checkpoint: Interrupted Tasks Do Not Have to Restart from Zero

Sessions die. That is a fact. API timeout, context overflow, process restart: there are many causes. The important question is: what happens after it dies?

Dual-track recovery strategy:

When a Checkpoint exists:

Before spawning a Sub-agent, the Orchestrator writes a checkpoint file containing the task name, current step, and completed steps. After a session dies, L3 scans hourly and finds the orphan checkpoint. If it qualifies, meaning recent enough, retry count not exceeded, and no human input required, it automatically spawns a new session to continue.

When no Checkpoint exists:

Step back and read the conversation history. Find the user’s last message, the original request, then respawn with partial results already present in the thread. This is the fallback, so even if nothing was prepared, the system can still recover itself.

Restart counting is stateless. Instead of using an external file to record “how many restarts happened,” count the 🏥 recovery markers in the thread. The conversation itself is the state store. Allow at most two automatic restarts; beyond that, notify a human.

Summary: Design Principles for Self-Healing Systems

Cost is architecture: layering is a budget strategy; do not use an LLM for what Shell can do
Prohibition first: in monitoring prompts for LLMs, write what they cannot do before what they should do
No news is the goal: zero-noise principle. Most runs should return HEARTBEAT_OK
Cheaper layers run more often: 5m → 15m → 60m with 3x gaps lets the zero-cost layer find issues first
Monitor the monitoring system: L1 monitors L2 liveness. Monitoring itself can fail, and the layer catching it must be zero-cost

The most expensive LLM should wake up only when real judgment is needed. The rest of the time, Shell scripts and regex are enough.

Penchan’s Experience

My main multi-agent setup is OpenClaw with ChatGPT and Claude split across agents. This self-healing approach came from several rounds of stuck sub-agents and heartbeat monitors hallucinating tool calls. The most noticeable lesson is “monitor the monitoring system”: the first LLM heartbeat started calling random tools, and it really polluted the day’s error logs. In the end, shell + regex was what kept it under control.

FAQ

Q: Why does an AI Agent need self-healing?

AI Agents fail differently from traditional programs. They can be “alive but stuck.” The process runs and HTTP responds, but the Agent says “give me five minutes” and never returns. Traditional monitoring only checks process liveness and cannot catch this semantic stall.

Q: How much does a three-layer self-healing architecture cost?

It can be extremely low. The bottom layer is pure Shell scripts and free; the middle layer uses cheap models such as Gemini Flash; the top layer only runs hourly, and most issues are handled by Shell before any LLM call is needed.

Q: Can I use this without Claude Code?

Yes. The architecture is general. As long as your Agent system can output transcripts and provide a way to query session status, you can adapt it.

— Penchan