I rebuilt the core algorithm of an automated decision system from end to end. The whole thing took about a month.

The old way looked a lot like throwing darts: hand the requirement to a model, wait for a pile of files, skim them, run it, and push forward if nothing explodes. If something really breaks, come back and patch it. The result is endless cleanup. Many bugs are not hard to fix; the expensive part is spending two weeks coding in a direction that should never have existed.

This time the team split was explicit: the human judged direction, wrote specs, and made tradeoffs; the fast model focused on code; the strong reasoning model challenged the design; another model family did review only; and an external model (Web AI) handled deep research. The whole process was slower and stretched across a month, but it clearly shifted from “using AI to type faster” to “leading a small development team.”

AI team workflow overview

Phase 0: Hypotheses First, Code Later

There was already a system running, and it looked decent on the surface. That makes it very easy to be fooled by the current state. When you look at a system every day, many assumptions quietly start feeling like common sense.

The first step was to stop and extract those assumptions one by one. I turned them into four question packs and sent them to four external models for deep research. This took less than a day. The reports came back with very different styles: some looked like research assistants, some like picky reviewers, and some proactively added counterexamples. But they hit the same hole: the design I trusted most had structural risk that could not be fixed by tweaking small parameters.

When the direction itself is uncertain, deep research before coding is much cheaper than starting implementation immediately. Writing code feels productive; research does not. The problem is that feeling productive is very good at lying.

One day of external research can block three weeks of polished work in the wrong direction.

Four-model hypothesis validation

There is a clear preference here: I would rather read four independent reports on the same question than one beautifully written single conclusion. Cross-check the claims; do not get convinced by a smooth narrative. This is exactly where external models are useful. They have not sunk human time into the original design, so they are less eager to help it save face.

Phase 1: Spec + Challenge Loop

Only after the direction flipped did I come back and write the v1 spec.

This spec did not optimize for prose. It was rigid by design. Goals, acceptance criteria, data integrity, and boundary conditions were all nailed down first. If this part is vague, every later model fills in the blanks differently, and you still end up cleaning up the mess.

The thing that made the biggest difference was the challenge prompt. I stopped asking models to “review this for me”; that prompt is too gentle. I changed it to directly ask for the places where this spec could fail, preferably with verifiable scenarios and counterexamples. The first challenge round immediately showed that this run had a chance to become stable. It found more than a dozen issues, including several deep ones, and even surfaced missing state-machine transitions and places where the system could deadlock.

Those spec-fixing rounds felt like walking with someone shining a flashlight on the path. v1 was still filling big holes, v2 tightened the conditions, and after v3 the problems gradually moved from missing chunks of logic to loosely bound edge cases. It took roughly five to six rounds before the challenge count stabilized. One extra hour spent on the spec often saved two hours of debugging later.

One warning: too few challenges does not necessarily mean the spec is good. Sometimes the prompt is too weak, or the reviewer is too polite. Models try to please users, and that is dangerous in development. Ask a loose question and it will go along with you, like a very polite colleague who is not actually helping.

Challenge Loop convergence diagram

Phase 2-3: Implementation + Cross-Family Review

Only after the spec converged enough did implementation get dispatched.

The work was split into several tasks. I drew the dependency graph first, then decided what could run in parallel. This step protects your own context. If the boundaries are not clean, the time saved up front gets paid back during merge.

Across the full run, quite a few files changed. The new logic was not the hardest part. The hardest part was how it touched old paths. That is also why I skipped self-review. The model that just connected the new thing usually keeps its attention glued to that line and easily misses adjacent modules that were affected. Bringing in another model family for review really does change the error distribution and perspective.

This paid off quickly. The first cross-family review caught several critical issues immediately, and none of them were inside the new feature itself; they were in old paths. Without that layer, the whole segment would have moved forward with fake confidence that “everything seems to pass,” only for the validation phase to slowly reveal the corners falling off.

Parallel development also hit a very typical pitfall. Two tasks touched adjacent functions at the same time. Each was logically correct alone, but together they overwrote each other. That merge did not explode badly, but it was enough of a reminder: parallel dispatch cannot rely on instinct. Which files will be touched, and which interfaces are shared, need to be mapped before the work starts.

Phase 4-5: Validation + Monitoring

Writing the code was not the finish line.

After full validation, the core logic turned out to be mostly fine. The real drag was another classification component. Its false-positive rate was high, so it filtered out scenarios that should have been kept, making the main logic look like it still needed more fixes. Without complete validation, I might have spent another week circling the wrong part.

This is why the canary layer matters: launch small, set monitoring metrics first, write kill criteria first, and automatically degrade if the threshold breaks. Put the process in place before pressure arrives, and people are less likely to thrash. System states were deliberately split into three buckets: Active, Bench, and Retired, so “take it down and observe” becomes a normal action.

Looking Back, the Time Went Somewhere Else

If I force the month into rough percentages: hypothesis validation 15%, Spec + Challenge 25%, implementation 20%, Review + Fix 15%, validation + deployment 25%.

That distribution is very different from before. Previously, about 70% of the time was stuck in code and debug, which makes you believe engineering is supposed to feel that exhausting. Now code review itself is only about 10%, which feels more reasonable. The real time moved into deciding whether this should be done, how it should be done, and how to prove it actually works after it is done.

The best investment was still Phase 0 external deep research. One day blocked a wrong direction outright. The most surprising gain was the dozen-plus issues from the first challenge round. If those had reached production, the repair cost could easily have been ten times higher.

This method is not light. For simple scripts and low-risk patches, there is no reason to stretch the process this far. But once the task touches real data, real cost, and real risk, slower is often cheaper.

One open line I am still testing: Phase 0 used only 15% of the time and already blocked one wrong direction, so in theory it deserves more investment. But once research becomes addictive, it is easy to slide into analysis paralysis. I still do not have a fully satisfying answer for where to stop.

Further Reading

GitHub: https://github.com/p3nchan/orchestration-playbook

This article is for research and discussion only. Not investment advice. DYOR + NFA.


Penchan’s Take

The Opus / Sonnet / Codex three-agent setup on OpenClaw really did run this way. Two things stood out most: one day of external DR in Phase 0 directly saved a week of implementation, and after changing the challenge prompt from “review this for me” to “find where this spec can fail and attach counterexamples,” the first round found more than a dozen valid issues. One extra hour on the spec usually saves two hours of debugging later. That pattern has been stable in my own workflow. Recommended.

FAQ

Q: Can one person really develop algorithms with an AI team?

Yes, but only with clear roles and process. In practice: the human owns strategy and specs, a fast model writes code, a strong reasoning model challenges the design, another model family reviews blind spots, and an external model handles deep research. The key is making every role do only what it is good at.

Q: What does the Challenge Loop look like in real AI development?

One algorithm spec usually goes through 4-6 challenge rounds. The first round often finds several real problems; later rounds gradually converge. Every challenge must include testable evidence, otherwise the loop turns into empty motion.

Q: What role does external Deep Research play in the development flow?

External DR is most useful at two points: validating the hypothesis before development starts, and searching for blind spots after the design has converged. Ordinary code development usually does not need DR.


— Penchan