When 4 AI Agents Debated How to Manage Their Own Memory

AI Agents are not just chatbots. When they run on a computer for a long time, manage projects, execute schedules, and produce content, they need a file system to store memory and work records.

This sounds easy: create a few folders, write a few documents, define naming rules, and it should be done.

In practice, this may be one of the most underestimated engineering challenges in an AI Agent system.

When Rules Cannot Govern Themselves

After an AI assistant system had been running for more than two months, opening it up for inspection showed this kind of situation.

The MEMORY.md memory index had a rule: “keep it under 1KB.” Its actual size was 7KB, seven times over budget. The cause was that the system generated new knowledge every day but had only one place to put it. The problem was not that nobody noticed.

Drafts for X platform content were scattered across four different paths. Two nearly identical folders, reference and references, held unrelated files. Within a dozen days, the system had automatically produced nearly 300 memory files.

These problems had one shared trait: the system could not even follow the simplest rules it had written for itself.

The instinctive response is to add more rules and stricter standards. But this is where it is worth pausing: if the current rules have already failed, will adding new rules make things better?

Letting AI Debate What to Do

To avoid getting stuck in one person’s blind spots, I designed an experiment: let four AI Agents each take a different position, read three research reports (ChatGPT Deep Research, Perplexity search analysis, and an academic report), then propose solutions from their own angles.

The four roles were:

The pragmatist believed “do not fix what is not broken.” He compared the existing system with the research suggestions one by one and found that layered boot, on-demand reading, and memory compression were already in place. His stance: do not rush into a big rewrite just because the research report looks impressive.

The architect believed “the interest on structural debt is higher than financial debt.” He used current growth data to forecast the system one year later. At the speed then, the memory directory would exceed 5,000 files in a year. His view: working now does not mean working three months from now.

The failure analyst ignored theory and looked only at evidence. He crawled through the whole file system and listed every misplaced file and every violated rule. His job was to show how each proposal would fail; proposing the solution was someone else’s job.

The migration strategist did not take sides. He cared about one thing: if we decide to change it, how do we change it without breaking the current system? He designed rollback paths for each change.

The Core Conflict: Add Structure or Reduce Rules?

When the four roles laid out their analysis, one core contradiction quickly appeared.

The failure analyst was clear: the system already could not follow its existing rules. Adding more structure would only increase load. His prediction: if every file were forced to add a YAML header, compliance would drop below 50% within three weeks. Since AI Agents constantly generate new files, nobody remembers to add headers every time.

The architect’s counterargument also held up: MEMORY.md bloated to seven times its limit because structurally there was nowhere else for that knowledge to go. Discipline was only the symptom. Information had only two destinations: the index file, which would bloat, or daily notes, which would disappear during archive.

The two views looked contradictory, but on closer inspection, they were saying the same thing.

MEMORY.md bloated because writing into MEMORY.md was easier than creating a new topic file. Drafts scattered because there was no clear “correct place” that naturally led humans, or AI, to put things where they belonged.

“Architecture vs discipline” is not the real either-or. If the architecture makes correct behavior the path of least resistance, discipline happens naturally.

That conclusion sounds simple, but it changed every later decision criterion. I stopped asking “is this rule strict enough?” and started asking “does this design make the right behavior easier?”

What We Changed

The final proposal focused on three things.

First, split the single memory index into an index plus topic files. The original 7KB MEMORY.md became a pure index under 500 bytes, plus a dozen topic-specific standalone files. On boot, the AI only loads the index; when needed, it reads the matching topic. This one change cut a meaningful share of boot-token cost.

Second, change the boot process from natural-language description into a strict checklist. Directly write “Step 1: read this file. Step 2: read that file.” Phrases like “please follow the principles below” are too loose. The more explicit it is, the less the AI skips or improvises.

Third, build a lightweight directory index that is automatically rebuilt every week in Markdown. No JSON, no database. If it goes stale, the system does not break; it just has one less reference.

What We Deliberately Did Not Do

This may have been the most valuable output of the debate.

Not using Johnny Decimal folder numbering. Renaming every folder with numbers, such as 11.02 instead of project_a, sounds organized. In practice, it would rename dozens of project directories, and every hard-coded path in code would break. Semantic names also mean more to AI than numbers. project_a is easier to understand than 11.02.

Not using Zettelkasten atomic notes. Academia loves this knowledge-management method. But the system only had dozens of notes. Using an academic-grade knowledge-management system at that scale is like using an ERP system to manage household expenses.

Cryptographic hash drift detection. Using cryptographic methods to detect accidental document changes makes sense in some systems. For a one-person, one-computer setup, git diff is enough.

Changing state files to JSON. JSON is more structured, but human readability disappears. The need to quickly scan status on a phone is more common than automated parsing.

All these proposals had logic, but at the current scale and usage pattern, they solved problems that had not happened yet. The failure analyst gave a useful decision frame: design for 2x scale first, then reassess when you reach 2x. Do not engineer for imaginary 100x scale.

Five Easy Traps

During the debate, the failure analyst pointed out five hidden assumptions that often appear when designing AI Agent systems. They sound reasonable, but if you do not notice them, they easily lead to overengineering.

“AI Agents work like databases.” They do not. Agents are language models. They read text, reason, and output text. Designing key-value-query structures for them may point in the wrong direction.

“Not finding files is the main bottleneck.” In reality, the bottleneck is often “knowing that it should go look.” No matter how tidy the folders are, if the Agent does not know a file exists, tidiness does not help.

“More structure means more reliability.” Structure creates coupling. Coupling creates fragility. Every extra abstraction is another joint that can break.

“The system will massively scale later.” Maybe. But designing today’s architecture for future problems often means paying today’s cost for problems that do not exist yet.

“The Agent will follow new rules.” This is the most dangerous assumption. Every new line of rules dilutes attention from all existing rules. The more rules there are, the lower the chance each one is followed.

What Happened Later

Looking back three months later, a few things are worth recording.

The pragmatist’s judgment was basically right. The changes that truly helped were the first three: memory splitting, boot checklist, and naming standards. Once those were done, the system’s messiness improved visibly.

The architect’s four-layer memory model has only reached two and a half layers so far. The design is fine; the need has not grown that far yet.

The failure analyst’s prediction was the sharpest. His argument that “mandatory YAML headers will drop below 50% compliance in three weeks” was so convincing that it was never implemented. His worry about a stale directory index did happen, but weekly auto-rebuilds were enough.

The migration strategist’s rollback framework was the most useful in practice. After trying two proposals, I decided to roll back. The process caused no damage. Being able to try and abandon something safely is itself good system design.

Take One Thing Away

If you are building your own AI Agent system, or wondering why your automation workflows keep getting messier, the whole debate compresses into one sentence:

Do not rush to add rules. First look at why the existing rules were not followed.

The answer is usually that the structure did not make correct behavior the easiest choice. “The AI is not obedient enough” is only the surface symptom.

Make the right thing the easiest thing, and the rest starts to happen naturally.

Penchan’s Experience

I have run OpenClaw’s Opus / Sonnet / Codex three-agent setup for a while. All memory uses Markdown files, with no RAG and no vector database. The two lessons from this debate keep proving themselves in practice. First, if structure can solve it, do not rely on rules. For example, put each project’s context in that project folder, and it will naturally be read without forcing the AI to remember to check. Second, enough is enough. The four-layer memory design has only used two and a half layers so far; the rest can wait until the need is real. The most useful tool was actually the rollback framework: try complex architecture on a small scale first. A design you can safely abandon is a good design.

FAQ

Q: Why do AI Agent memory systems get out of control?

AI Agents automatically generate many files every day: memory notes, work logs, drafts. If the structure only gives them one or two places to store things, files either bloat in one place or scatter everywhere. The core problem is usually that the architecture does not make correct behavior the path of least resistance.

Q: What is a multi-agent debate?

It means asking multiple AI Agents to take different positions, such as pragmatist, architect, failure analyst, and migration strategist, then analyze the same material and propose solutions. The collision of viewpoints reveals blind spots a single perspective can miss.

Q: What is the most common mistake when designing AI Agent systems?

The common mistake is thinking “more rules will fix the mess.” In reality, every new rule dilutes attention from existing rules. The better move is to improve the architecture so the correct behavior becomes the easiest behavior.

— Penchan