I Built a Memory System for My AI Agent: From ACE Paper to Production-Grade Loop

My AI coding agent had amnesia. Every new session was a blank slate—forgetting completed tasks, relearned pitfalls, and never updating its own skill library. Over a weekend, I built a complete context engineering system based on two top-tier papers. Result: 84% reduction in skill descriptions, 16 cron jobs running autonomously, 5 draft skills auto-generated.

The biggest bottleneck for AI agents isn't reasoning. It's memory. Every new session is a blank slate. 76% of my skills were never used. Fixing this doesn't require the next frontier model. It requires architecture.

Quantifying the problem

Before building this system, I audited my Hermes Agent:

Metric	Number	What it means
Total skills	78	—
Never used	59 (76%)	Agent had no idea they existed
Descriptions >100 chars	39 (57%)	Bloating prompt tokens
MEMORY.md size	6,765 chars	600+ lines of duplicated dates (bug)
Curator runs	5	0 stale, 0 archived (never cleaned anything)
Solution files	43	All dumped in one directory, unsearchable

In one sentence: the Agent was accumulating knowledge, but that knowledge was never organized, retrieved, or written back.

This isn't a model capability problem. It's an architecture problem.

The research foundation

I spent half a day reading two 2026 papers:

ACE (Agentic Context Engineering) — ICLR 2026, arXiv 2510.04618

Core finding: context is not a static summary—it's an evolving playbook. Three roles form a closed loop:

Role	Responsibility	Paper Results
Generator	Execute tasks	—
Reflector	Reflect on outputs	—
Curator	Organize, deduplicate, decay, write back	+10.6% benchmark, -86.9% latency, -75.1% cost

Key insight: Bulletized context + incremental updates. Never rewrite the entire context—only append new structured entries. The paper documented a case where prose context went from 18,282 tokens to 122 tokens after compression, losing almost all information. That's why "write it shorter" isn't the solution—"write it more structured" is.

Memory for Autonomous LLM Agents — arXiv 2603.07670

A 58-page survey covering agent memory research from 2022 to 2026. Conclusion: the industry is converging on a three-tier architecture—episodic, semantic, and procedural memory. This aligns well with ACE's three-role model.

System design

Based on these two papers, I built a complete context engineering system. Core architecture:

Session End (Hook-triggered, not waiting for cron)
    ↓
reflect.sh --classify      → Auto-classify into patterns/bugs/knowledge
index.py                    → .index.yaml structured index
memory-hook.sh              → Threshold check → auto-update MEMORY.md
pattern-detector.sh         → 3+ same pattern → draft skill
skill-usage-hook.sh         → Snapshot diff tracking
    ↓
Cron safety net (if hook fails)
    5 cron jobs: decay / dedup / LLM consolidation / poison scan / disaster recovery

Design decision 1: Hook over Cron + Agent

The common industry approach: "scheduled cron + Agent decides." Problem: the Agent might "forget" to execute, with 24-hour latency.

My approach: Hook-triggered primary, Cron as safety net.

Scenario	Hook (instant)	Cron (fallback)
New solution classification	✅ session-end	❌
MEMORY.md update	✅ session-end	✅ daily 4:00 AM
Pattern detection	✅ session-end	✅ weekly Sun 3:30 AM
Decay/dedup	❌ (too heavy)	✅ weekly early AM

Design decision 2: Zero damage to Hermes core

All changes live under ~/.hermes/skills/ and ~/.hermes/scripts/. Not a single line of Hermes core code was modified. Delete the scripts + cron jobs to revert to vanilla Hermes.

Design decision 3: Full automation—reflect → classify → write back

Before: Agent hits a pitfall → writes solution file → next session doesn't know the file exists.

After:

Task complete → reflect.sh
    → Auto-classify into patterns/bugs/knowledge
    → index.py registers in .index.yaml (structured index)
    → pattern-detector finds 3+ same-pattern → auto-generates draft skill
    → 5+ high-importance → auto-writes back to MEMORY.md

Results

Skill system

Metric	Before	After
Description total chars	~15,000	2,322 (-84%)
Over-length descriptions	39 (57%)	0
Ghost entries (no actual skill)	70	0
Average description length	190 chars	34 chars

Memory system

Metric	Status
MEMORY.md chars	2,166 / 2,200 ✅
Solutions index	.index.yaml 362 lines
New cron jobs	5 (decay/dedup/merge/scan/recovery)
Pattern candidates	5 draft skills

Self-evolution

reflect.sh --classify → 33 indexed entries
    ↓
pattern-detector.sh → 3 high-importance patterns (17/16/6 occurrences)
    ↓
.candidates/ → 5 draft SKILL.md files
    ↓
Awaiting curator review → auto-publish

Lessons learned (the hard way)

This system wasn't built in one shot. Here are the real pitfalls:

Pitfall 1: `str.replace` full-text replacement

memory-sync.py used content.replace("Last updated:", f"Last updated:{today}") to update timestamps. But this is a full-text replace—each run prepends another date before the existing one. After weeks, MEMORY.md hit 6,765 bytes of almost nothing but duplicated dates.

Fix: Use re.sub(r"^> Last updated: [\d-]+\n", ...) for precise matching.

Pitfall 2: access_count never increments

The entire decay system depended on access_count to determine if files were used. But nothing increments this counter. Result: all files marked "never used," archived at 30 days, deleted at 90 days.

Fix: Switch to file mtime (modification time)—no counter needed.

Pitfall 3: yq not installed

Planned to use yq for YAML operations. Server didn't have it. PyYAML was available.

Lesson: Check installed tools before designing.

Pitfall 4: Source paths missing project context

Cross-project pattern detection needs to know "which projects did this pattern appear in." But the source field was session://date with no project path.

Fix: Inject project=$PWD during classification.

Pitfall 5: Agent decision-making is unreliable

The original design was "cron-triggered Agent reads JSON → decides whether to update MEMORY.md." But the Agent might not read it, or might read and not act.

Fix: Switch to Hook auto-execution—zero Agent decisions required.

Industry comparison

Capability	Claude Code Auto-Memory	Codex Chronicle	Ours
Agent self-writes context	✅ CLAUDE.md	✅ chronicle	✅ AGENTS.md + MEMORY.md writeback
Progressive loading	✅ description only	✅ 2% context cap	⚠️ Hermes architecture constraint
File decay	❌	❌	✅ mtime + trust decay
Pattern → skill generation	❌	❌	✅ 3+ same-tag → draft
Memory poisoning defense	❌	❌	✅ MINJA pattern scanning
Zero-damage design	❌ (coupled to core)	❌	✅ No core code modified

Our unique advantages: ACE three-role closed loop (only complete implementation), Hook-first architecture, self-evolving skill generation.

Who this is for

If you:

Maintain 3+ AI Agent projects
Have 50+ skills but no idea which ones are actually used
Your Agent's memory files are silently ballooning
Want your Agent to learn new skills from its own pitfall records

This system can be adapted directly. All scripts live under ~/.hermes/skills/ and ~/.hermes/scripts/, with zero core Agent modifications.

Next steps

The current system runs the full Reflector→Curator→Evolver loop. Coming next:

Embedding-based semantic dedup — currently using content hash, which misses near-duplicates
Cross-project AGENTS.md auto-update — when patterns appear across 2+ projects
LoCoMo benchmark evaluation — quantify memory improvement with standard benchmarks

This system is based on ACE (arXiv 2510.04618) and Memory for Autonomous LLM Agents (arXiv 2603.07670). All code at ~/context-engineering-system/, zero Hermes core modifications.