The previous article covered a lower-level question: how multiple agents collaborate without burning cost on communication. But once the foundation is stable, quality does not automatically follow.

A common development story: ask one agent to write a patch, then ask it to review its own patch. It replies looks good; tests run; everything breaks. Smooth collaboration and production-ready output still seem separated by a whole layer of method.

This article is about that layer. The point is not whether you can ask many agents to work together. The point is whether each step actually brings in new information. The three patterns below are supported by research, and the premises are simple. Remove any one of them, and later review easily turns into theater.

Pattern 6: Challenge Loop

The most common fake review looks like this: ask Agent B to review Agent A’s code, and it returns a pile of professional-sounding comments, such as “consider error handling” or “there may be risk here.” The sentences look like real review, but the information density is low. After two rounds of fixes, the bug is still there, because nobody ever produced testable evidence.

A more stable rule: every challenge must include evidence. Most often this means a failing test or reproducible counterexample; sometimes pointing to the exact spec clause that is not satisfied is enough. If it cannot provide that, treat it as noise. This sounds harsh, but the effect is very different: review moves from “this feels weird” to “this input fails, and this acceptance criterion is unmet.”

A 2025 study specifically measuring AI sycophancy found that 58% of LLM responses contained flattering tendencies. Plainly: models easily go along with the user, and just as easily go along with the previous model. If review is not deliberately adversarial, it slides into a rubber stamp. “Help me review this” is too empty. Ask for “3 places that can be proven to fail” or “only return items that violate the spec.”

One number worth tracking: the percentage of challenges that lead to real changes. If a review round raises 10 points and only 1 changes code or spec, the hit rate is under 20%, and that round is mostly performance. The problem is usually not that the reviewer is dumb; the prompt pushed it to generate criticism-shaped filler.

This pattern is especially useful in spec review. One financial algorithm spec went through a 6-round challenge loop. R1 immediately found 3 real issues: data leakage, a missing slippage model, and an uncapped retry path. R2 added a bug in the data split purge window. By round 6, effective challenges fell to 0 and external tests were all green. Running 6 rounds is not what makes it rigorous; the key is that every round must provide evidence, so the decision maker knows whether to change anything.

Challenge loops also need convergence limits. Normally, the main problems should appear in the first 2-3 rounds. Past that, a common failure mode is that each round gets longer and drifts further. Research classifies this kind of overthinking loop as a high-frequency failure mode. Longer rounds make everyone look busy while information density drops.

Challenge Loop workflow diagram, showing each challenge must include a failing test or spec violation

Pattern 7: Cross-Family Review

Model self-review can almost be skipped.

The reason is simple. Right after a model writes something, that reasoning path is still warm. It follows its own previous logic and finds its own output more reasonable than it really is. The research is direct too: Self-Correction Bench measured a 64.5% blind-spot rate, and another 2026 study found that models are roughly 5 times more likely to be lenient when reviewing their own output, including risky content.

Spend review budget on independence, not ritual. The most practical move is cross-family review. Let Claude review what ChatGPT wrote; let ChatGPT review what Claude wrote. The reason is pragmatic: a 2025 analysis of more than 350 models found error correlation within the same model family is clearly higher than across families. To catch blind spots, the reviewer needs to bring different priors into the room.

Do not mythologize it. Output similarity among frontier models is still high, with an ICML 2025 study observing values close to 90% at the top end. Cross-family review improves hit rate, but it does not grant omniscience. Many blind spots are shared by all models: vague specs, weak tests, or missing background knowledge. Switching reviewers can still miss those.

The anti-pattern to watch is “Validation Retreat.” When a repair loop gets stuck, an agent often changes the tests, makes the screen green, and pretends the issue is solved. It looks successful on the surface, but the benchmark was loosened. When reviewing a fix, look back: did this round touch tests? If yes, did the spec authorize it? Did it change the acceptance standard, or fix the actual error?

Field example: Codex wrote a patch, and self-review passed almost immediately. Then Claude Sonnet reviewed it line by line against the acceptance criteria and immediately found 3 critical issues, all hidden in old paths. Codex’s attention was on the new logic it had just changed, so it naturally missed those places. A different family really does bring a different view.

The review team does not need to grow forever. Research and practice both point in the same direction: two reviewers are near the sweet spot. Add more, and marginal benefit falls quickly while coordination cost keeps rising. For ordinary code changes, one writing agent, one cross-family reviewer, plus clear spec and tests is more stable than piling on three or four reviewers that copy each other.

Cross-family Review diagram: GPT produces code, Claude reviews against acceptance criteria, with self-review blind spot 64.5% called out

Pattern 8: Spec-Driven Development

Many people think bugs come from weak review. In practice, the ceiling is usually the spec.

If the initial task handed to an agent is vague, then no matter how many review rounds follow, the writer and reviewer are both feeling around inside the same fog. AI review without an executable spec is structurally circular: both sides reason from the same vague text. If they guess right, great. If they guess wrong, they are wrong together.

There are numbers behind this. In the ClarifyGPT study, adding spec refinement raised Pass@1 from 70.96% to 80.80%, a 13.87% improvement. Many repair loops look like code fixes, but they are actually filling in things the spec should have made explicit.

For complex handoffs, I usually include these blocks:

Objective
Context
Acceptance Criteria
Interface Contract
Anti-patterns
Review Checklist

If the task touches data or backtesting, add data integrity requirements too: no look-ahead, realistic transaction costs, explicit data split method. These fields are annoying to write, but they block many problems that would otherwise explode later.

Before/after example:

  • Old spec: “Change the regime detector to use absolute APR instead of relative.”
  • New spec: at minimum, write one clear objective, several acceptance criteria, data integrity constraints, and explicitly forbidden approaches.

This is tedious, but the effect is obvious. Codex’s first submission is usually much closer to the target, and repair loops often shrink from an average of 3 rounds to 1-2.

Specs are not better just because they are thicker. For simple tasks, a verbal instruction plus 1-2 acceptance criteria is often enough. If a medium-complexity task is forced through a full template, the process slows down noticeably. The principle is: write only enough to remove ambiguity. More is not better. The spec’s job is convergence, not dumping every thought in your head onto the next agent.

Spec-Driven Development: Structured Handoff Schema template, highlighting objective, acceptance criteria, and review checklist

Bonus: Connecting the Three Patterns

The most common development pipeline is short:

Spec(Structured Handoff)
  → Challenge Loop(find spec gaps)
  → Implementation
  → Cross-family Review(find code gaps)
  → Fixes
  → Done

The core principle is one sentence: every step must introduce new information. If two steps are just the same model thinking about the same material again, I tend to cut one.

The same applies to multi-agent parallelism. DeepMind’s orchestration scaling study is useful here: tasks that can be split in parallel can gain 80.9%, while serial-reasoning tasks can drop 39% to 70%. Communication cost also grows exponentially around 1.7x. Plainly, after you add more agents, they often stop doing work and start waiting for each other, then spend time explaining and verifying.

Do not open too many agents. Three to four is roughly the upper bound. Beyond that, token consumption easily grows to about 3.5x a single-agent flow. Parallelism is worth it only when modules can genuinely be split independently. Otherwise, a short flow with high independence is usually more stable.

Models will keep improving, and the specific numbers in 2024-2026 research will probably be rewritten. Blind-spot rates will fall, review tools will improve, and spec generation will likely become much better.

Several lower-level principles should hold for longer: review needs an independent view, spec quality sets the ceiling, and after a certain point iteration is usually empty motion. Today’s self-review blind-spot rate is still around the 64.5% order of magnitude. As models improve, that number should fall. How low it must get before self-review deserves to return to the main path remains an open question.

Further Reading

GitHub: https://github.com/p3nchan/orchestration-playbook

Penchan’s Take

The real OpenClaw setup is Opus / Sonnet / ChatGPT as three agents, and this Challenge → cross-family review → spec-driven workflow grew out of that. The biggest lesson: send an Opus-written spec to Codex to pick holes, then let Sonnet review the implementation. Blind spots are found much more often by another model family than by same-family self-review. Thin spec, evidence-backed challenge, and reviewer family switch: all three need to happen together for the flow to become stable.

FAQ

Q: Why is AI self-review unreliable?

Research has found a 64.5% blind-spot rate when AI models review their own output, and models can be about 5 times more forgiving of their own work. That means when the same model writes and reviews, most issues are invisible to it. The practical fix is to review with a different model family.

Q: What is a Challenge Loop?

A Challenge Loop is an adversarial review process: one agent produces work, another agent tries to find problems, but every challenge must include testable evidence. Criticism without evidence does not count. Research and practice both show this catches more real problems than generic review.

Q: Why is spec quality more important than adding more review rounds?

Research shows that improving spec quality can raise code correctness by 13.87%, better than simply adding another review round. If the spec is vague, the reviewer and coder share the same misunderstanding, and extra reviews still will not find it.


— Penchan