I Built a Memory System for My AI Agent: From ACE Paper to Production-Grade Loop
My AI coding agent had amnesia. Every new session was a blank slate—forgetting completed tasks, relearned pitfalls, and never updating its own skill library. Over a weekend, I built a complete context engineering system based on two top-tier papers. Result: 84% reduction in skill descriptions, 16 cron jobs running autonomously, 5 draft skills auto-generated.

The biggest bottleneck for AI agents isn't reasoning. It's memory. Every new session is a blank slate. 76% of my skills were never used. Fixing this doesn't require the next frontier model. It requires architecture.
Quantifying the problem
Before building this system, I audited my Hermes Agent:
| Metric | Number | What it means |
|---|---|---|
| Total skills | 78 | — |
| Never used | 59 (76%) | Agent had no idea they existed |
| Descriptions >100 chars | 39 (57%) | Bloating prompt tokens |
| MEMORY.md size | 6,765 chars | 600+ lines of duplicated dates (bug) |
| Curator runs | 5 | 0 stale, 0 archived (never cleaned anything) |
| Solution files | 43 | All dumped in one directory, unsearchable |
In one sentence: the Agent was accumulating knowledge, but that knowledge was never organized, retrieved, or written back.
This isn't a model capability problem. It's an architecture problem.
The research foundation
I spent half a day reading two 2026 papers:
ACE (Agentic Context Engineering) — ICLR 2026, arXiv 2510.04618
Core finding: context is not a static summary—it's an evolving playbook. Three roles form a closed loop:
| Role | Responsibility | Paper Results |
|---|---|---|
| Generator | Execute tasks | — |
| Reflector | Reflect on outputs | — |
| Curator | Organize, deduplicate, decay, write back | +10.6% benchmark, -86.9% latency, -75.1% cost |
Key insight: Bulletized context + incremental updates. Never rewrite the entire context—only append new structured entries. The paper documented a case where prose context went from 18,282 tokens to 122 tokens after compression, losing almost all information. That's why "write it shorter" isn't the solution—"write it more structured" is.
Memory for Autonomous LLM Agents — arXiv 2603.07670
A 58-page survey covering agent memory research from 2022 to 2026. Conclusion: the industry is converging on a three-tier architecture—episodic, semantic, and procedural memory. This aligns well with ACE's three-role model.
System design
Based on these two papers, I built a complete context engineering system. Core architecture:
Session End (Hook-triggered, not waiting for cron)
↓
reflect.sh --classify → Auto-classify into patterns/bugs/knowledge
index.py → .index.yaml structured index
memory-hook.sh → Threshold check → auto-update MEMORY.md
pattern-detector.sh → 3+ same pattern → draft skill
skill-usage-hook.sh → Snapshot diff tracking
↓
Cron safety net (if hook fails)
5 cron jobs: decay / dedup / LLM consolidation / poison scan / disaster recovery
Design decision 1: Hook over Cron + Agent
The common industry approach: "scheduled cron + Agent decides." Problem: the Agent might "forget" to execute, with 24-hour latency.
My approach: Hook-triggered primary, Cron as safety net.
| Scenario | Hook (instant) | Cron (fallback) |
|---|---|---|
| New solution classification | ✅ session-end | ❌ |
| MEMORY.md update | ✅ session-end | ✅ daily 4:00 AM |
| Pattern detection | ✅ session-end | ✅ weekly Sun 3:30 AM |
| Decay/dedup | ❌ (too heavy) | ✅ weekly early AM |
Design decision 2: Zero damage to Hermes core
All changes live under ~/.hermes/skills/ and ~/.hermes/scripts/. Not a single line of Hermes core code was modified. Delete the scripts + cron jobs to revert to vanilla Hermes.
Design decision 3: Full automation—reflect → classify → write back
Before: Agent hits a pitfall → writes solution file → next session doesn't know the file exists.
After:
Task complete → reflect.sh
→ Auto-classify into patterns/bugs/knowledge
→ index.py registers in .index.yaml (structured index)
→ pattern-detector finds 3+ same-pattern → auto-generates draft skill
→ 5+ high-importance → auto-writes back to MEMORY.md
Results
Skill system
| Metric | Before | After |
|---|---|---|
| Description total chars | ~15,000 | 2,322 (-84%) |
| Over-length descriptions | 39 (57%) | 0 |
| Ghost entries (no actual skill) | 70 | 0 |
| Average description length | 190 chars | 34 chars |
Memory system
| Metric | Status |
|---|---|
| MEMORY.md chars | 2,166 / 2,200 ✅ |
| Solutions index | .index.yaml 362 lines |
| New cron jobs | 5 (decay/dedup/merge/scan/recovery) |
| Pattern candidates | 5 draft skills |
Self-evolution
reflect.sh --classify → 33 indexed entries
↓
pattern-detector.sh → 3 high-importance patterns (17/16/6 occurrences)
↓
.candidates/ → 5 draft SKILL.md files
↓
Awaiting curator review → auto-publish
Lessons learned (the hard way)
This system wasn't built in one shot. Here are the real pitfalls:
Pitfall 1: str.replace full-text replacement
memory-sync.py used content.replace("Last updated:", f"Last updated:{today}") to update timestamps. But this is a full-text replace—each run prepends another date before the existing one. After weeks, MEMORY.md hit 6,765 bytes of almost nothing but duplicated dates.
Fix: Use re.sub(r"^> Last updated: [\d-]+\n", ...) for precise matching.
Pitfall 2: access_count never increments
The entire decay system depended on access_count to determine if files were used. But nothing increments this counter. Result: all files marked "never used," archived at 30 days, deleted at 90 days.
Fix: Switch to file mtime (modification time)—no counter needed.
Pitfall 3: yq not installed
Planned to use yq for YAML operations. Server didn't have it. PyYAML was available.
Lesson: Check installed tools before designing.
Pitfall 4: Source paths missing project context
Cross-project pattern detection needs to know "which projects did this pattern appear in." But the source field was session://date with no project path.
Fix: Inject project=$PWD during classification.
Pitfall 5: Agent decision-making is unreliable
The original design was "cron-triggered Agent reads JSON → decides whether to update MEMORY.md." But the Agent might not read it, or might read and not act.
Fix: Switch to Hook auto-execution—zero Agent decisions required.
Industry comparison
| Capability | Claude Code Auto-Memory | Codex Chronicle | Ours |
|---|---|---|---|
| Agent self-writes context | ✅ CLAUDE.md | ✅ chronicle | ✅ AGENTS.md + MEMORY.md writeback |
| Progressive loading | ✅ description only | ✅ 2% context cap | ⚠️ Hermes architecture constraint |
| File decay | ❌ | ❌ | ✅ mtime + trust decay |
| Pattern → skill generation | ❌ | ❌ | ✅ 3+ same-tag → draft |
| Memory poisoning defense | ❌ | ❌ | ✅ MINJA pattern scanning |
| Zero-damage design | ❌ (coupled to core) | ❌ | ✅ No core code modified |
Our unique advantages: ACE three-role closed loop (only complete implementation), Hook-first architecture, self-evolving skill generation.
Who this is for
If you:
- Maintain 3+ AI Agent projects
- Have 50+ skills but no idea which ones are actually used
- Your Agent's memory files are silently ballooning
- Want your Agent to learn new skills from its own pitfall records
This system can be adapted directly. All scripts live under ~/.hermes/skills/ and ~/.hermes/scripts/, with zero core Agent modifications.
Next steps
The current system runs the full Reflector→Curator→Evolver loop. Coming next:
- Embedding-based semantic dedup — currently using content hash, which misses near-duplicates
- Cross-project AGENTS.md auto-update — when patterns appear across 2+ projects
- LoCoMo benchmark evaluation — quantify memory improvement with standard benchmarks
This system is based on ACE (arXiv 2510.04618) and Memory for Autonomous LLM Agents (arXiv 2603.07670). All code at ~/context-engineering-system/, zero Hermes core modifications.