Giving Hermes Agent an 'External Enhancement System': Deep Transformation of the Memory System and Skill System
A complete practice record of AI Agent memory management + Skill self-evolution + automated maintenance. Without modifying Hermes core code, using external plugins to make the Agent smarter with use.
Giving Hermes Agent an 'External Enhancement System': Deep Transformation of the Memory System and Skill System
A complete practice record of AI Agent memory management + Skill self-evolution + automated maintenance. Without modifying Hermes core code, using external plugins to make the Agent smarter with use.
TL;DR
| Dimension | Before | After |
|---|---|---|
| Cross-session memory | Each new session starts from scratch | 5-layer memory + auto-maintenance |
| Knowledge accumulation | Experience dies with the session | Reflection → solutions/ → auto-promotion |
| Skill management | 60 skills flat, no quality control | Hub layering + quality gates + anti-bloat |
| Token consumption | Large skills fully injected | On-demand loading, 87→56 enabled |
| Automation | Manual maintenance | Session-level/daily/weekly cron full coverage |
Part 1: Memory System
1.1 Problem: The Agent's "Goldfish Memory"
Hermes Agent natively has three layers of memory: MEMORY.md (system prompt injection), memory tool (memories/MEMORY.md), and session_search (SQLite conversation history). It looks sufficient on paper, but in practice there are three fatal issues:
| Problem | Manifestation |
|---|---|
| Cross-session amnesia | New session starts, Agent doesn't remember yesterday's debugged bugs or decisions made |
| Knowledge doesn't accumulate | Pitfalls get stepped in again; solutions disappear when the session ends |
| Manual maintenance | MEMORY.md fills up and needs manual cleanup; outdated info never expires automatically |
1.2 Our Transformation: 5-Layer Memory Architecture
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: System Prompt MEMORY/USER │
│ Always loaded, 2200 char limit, stores most critical facts │
│ Files: ~/.hermes/MEMORY.md + ~/.hermes/memories/MEMORY.md │
├─────────────────────────────────────────────────────────────┤
│ Layer 2: fact_store (Holographic Structured Memory) │
│ Vector embeddings + entity relationships + trust scores │
│ Supports semantic search, entity probing, compositional reasoning │
├─────────────────────────────────────────────────────────────┤
│ Layer 3: session_search (FTS5 Conversation History) │
│ Full-text search of past conversations, supports │
│ discovery/scroll/read/browse four modes │
├─────────────────────────────────────────────────────────────┤
│ Layer 4: compound-system (Reflection + Knowledge Base) │
│ Auto-reflection after tasks, writes to solutions/, │
│ with track/level three-tier management │
├─────────────────────────────────────────────────────────────┤
│ Layer 5: Skill System (Reusable Skills) │
│ Auto-extracted from success patterns, loaded on demand, │
│ anti-bloat │
└─────────────────────────────────────────────────────────────┘
1.3 compound-system: Teaching the Agent to Reflect
This is the core of the entire memory system. Each time a non-trivial task is completed, the Agent automatically executes a reflection workflow:
Task Complete
↓
compound.sh reflect → Determine if reflection is needed
↓ Needs reflection
reflect.sh → Call LLM analysis, output structured JSON
↓
write-solution.sh → Write to solutions/ directory
↓
Next time a similar issue arises → compound.sh search → Hit historical solution
Key Design Decisions:
| Decision | Choice | Reason |
|---|---|---|
| Reflection trigger | Auto-determine | Not every task deserves reflection, avoid noise |
| Storage format | Markdown files | Human-readable, git-friendly, easy to search |
| Knowledge grading | working → session → longterm | Three-tier promotion, avoid info explosion |
| Decay mechanism | Auto-archive after 30 days | Unused knowledge cools down automatically |
| Search method | FTS5 full-text search | Runs locally, no external dependencies |
Actual Results:
$ bash compound.sh search "ruff format"
[INFO] Found 3 solution(s) for: ruff format
[1] "ruff format debt causing CI failure"
File: solutions/bug/ticketpilot-ruff-format-debt-2026-06-21.md
Track: bug, Level: session
[2] "Must check ruff after sub-agent module split"
File: solutions/session/ruff-subagent-lesson.md
Track: session, Level: session
1.4 reflect.sh: LLM-Driven Structured Reflection
Initially reflect.sh only returned raw JSON, and the LLM often just echoed the input without doing any analysis. After multiple iterations:
| Iteration | Problem | Fix |
|---|---|---|
| v1 | "Return JSON only" prompt too simple | Switch to structured prompt, explicitly require root_cause/solution/lessons/patterns |
| v2 | curl bash quote nesting explosion | Switch to Python urllib |
| v3 | log_info polluting stdout | Redirect to stderr |
| v4 | write-solution.sh dropping fields | Extend template to support lessons/patterns |
| v5 | Confusion between compound.sh reflect vs reflect.sh | Document the distinction: former decides if reflection is needed, latter actually calls LLM |
Final pipeline:
# One-command reflection + archiving
bash ~/.hermes/skills/compound-system/scripts/reflect.sh \
"Fixed QQ Bot heartbeat loop" "success" "low" "" \
| bash ~/.hermes/skills/compound-system/scripts/write-solution.sh1.5 Memory Maintenance Automation
Manual maintenance is not sustainable. We built an automated maintenance chain:
| Trigger | Script | Action |
|---|---|---|
| Session end | session-end.sh | Compress MEMORY.md + archive old solutions + reflect |
| Daily at 2 AM | daily-maintenance.sh | Decay check + dedup check + sync check |
| Weekly at 3 AM | weekly-maintenance.sh | Elimination check + merge check + size check |
Decay Mechanism Principle:
# memory-decay.py core logic
# Trust score decays exponentially, 30-day half-life
new_trust = old_trust * (0.5 ** (days_elapsed / 30))
# Below threshold → auto-archive to .archive/Dedup Mechanism:
# memory-dedup.py core logic
# FTS5 full-text search + SequenceMatcher similarity
# Similarity > 0.85 → merge, retain the newer version1.6 The MEMORY.md Three-File Trap
This was the trickiest discovery during debugging — there are three different MEMORY.md files in the system:
| File | Purpose | Format | Character Limit |
|---|---|---|---|
~/.hermes/MEMORY.md | System prompt injection | Structured markdown | No hard limit |
~/.hermes/memories/MEMORY.md | memory tool operations | § separator format | 2,200 |
~/.hermes/memory/MEMORY.md | Legacy, deprecated | — | — |
The memory tool's add/replace/remove operates on memories/MEMORY.md, not the one injected into the system prompt. When the tool reports "at X/2200 chars," you need to manually edit memories/MEMORY.md to clean up.
1.7 Production Data
| Metric | Before | After |
|---|---|---|
| Cross-session knowledge retention | 0% | ~90% (compound-system) |
| Repeated pitfall rate | High | Low (search hits historical solutions) |
| MEMORY.md maintenance | Manual | Automatic (cron + decay) |
| New session cold start time | 5-10 minutes | 1-2 minutes |
Part 2: Skill System
2.1 Problem: Skill Bloat
Hermes Agent's skill mechanism essentially injects SKILL.md content into the system prompt. A 300-line skill gets fully injected into every conversation, whether you use it or not.
| Problem | Impact |
|---|---|
| Large number of skills | 60+ skills, each with descriptions and instructions |
| Lark series dominates | 24 Lark skills, never used in QQ chat |
| No quality control | Some skills exceed 500 lines, cramming in every detail |
| No elimination mechanism | Outdated skills permanently consume tokens |
2.2 Transformation 1: Hub + Focused Skill Layering
Core principle: One large skill is worse than multiple small skills.
| Dimension | One Large Skill | Multiple Small Skills |
|---|---|---|
| Token consumption | High (always loads everything) | Low (on-demand loading) |
| Attention | "Lost in the Middle" effect | Each one is concise |
| Maintenance | Changing one thing affects everything | Independent updates |
| Reusability | Hard to reuse | Composable |
The transformed architecture:
Hub Skill (Index layer, <100 lines)
context-engineering-hub
Quick reference + on-demand loading of sub-skills
↓ On-demand loading
Focused Skills (Function layer, each <200 lines)
project-context File structure, templates
skill-evolver Generate skills from success patterns
context-validation Verify improvement effectiveness
self-evolution-system Complete closed-loop architecture
↓
Tool Skills (Support layer)
memory-orchestrator Unified memory orchestration
resilient-web-search Search API fallback
Real Case: context-engineering Split
| Before Split | After Split |
|---|---|
| 1 skill, 683 lines | 4 skills: hub(95 lines) + project-context(148 lines) + token-compression(120 lines) + session-handoff(89 lines) |
| Always fully loaded | Load corresponding sub-skill on demand |
| Information overload | Each is concise and focused |
2.3 Transformation 2: Quality Gates
Every skill creation/update must pass quality gates:
| Check | Threshold | Action on Failure |
|---|---|---|
| Line count | ≤200 lines | Compress or split |
| Three-part trap | At least 1 | Add one |
| Edge-case three-part | Required | Add one |
| Code example | At least 1 | Add one |
| Pointer reference | Don't embed full documentation | Change to pointer |
Three-Part Trap Example:
- **psycopg_pool transaction rollback**: `with` block auto-rollbacks on exit → use `autocommit=True` for write operations
- **ruff format debt**: CI runs `ruff format --check` → run `ruff format .` before committing2.4 Transformation 3: Anti-Bloat Mechanisms
| Mechanism | Rule | Action |
|---|---|---|
| Decay | Not used for 30 days | Archive to .archive/ |
| Merge | 2+ similar skills | Merge into one |
| Cap | Exceeds 200 lines | Compress |
| Eliminate | Not referenced for 6 months | Delete |
2.5 Transformation 4: Smart Skill Injection
Hermes natively loads descriptions of all enabled skills every conversation. Our optimization:
# config.yaml
skills:
disabled:
# Lark series (25 skills) — not needed for QQ chat
- lark-approval
- lark-apps
- lark-attendance
# ... 25 total
# Other unnecessary skills
- yuanbao
- honcho
- hermes-memory-setup
# ... 6 total| Metric | Before Optimization | After Optimization |
|---|---|---|
| Enabled skills | 87 | 56 (-31) |
| System prompt injection | ~4600 lines skill descriptions | ~2800 lines |
| Estimated tokens/request | — | -2000~3000 |
2.6 resilient-web-search: Search Fault Tolerance
Tavily API frequently returns 432 (quota exhausted). We built a fallback chain:
web_search (Tavily)
↓ On failure
web_search_plus (auto routing)
↓ On failure
web_search_plus (explicit provider polling)
↓ On failure
Return error + list of attempted providers
This skill is referenced by all search-related cron jobs, ensuring that even if one API goes down, we don't come back empty-handed.
2.7 Skill Optimization Production Data
| Metric | Before Optimization | After Optimization |
|---|---|---|
| Total skills | 60 (4629 lines) | 56 (~2200 lines effective injection) |
| Max single skill | 648 lines (lark-mail) | ≤200 lines |
| On-demand loading | None | hub + focused |
| Anti-bloat | None | Decay + merge + elimination |
Part 3: Automation System
3.1 Cron Job Schedule (UTC+8)
| Job | Schedule | Description |
|---|---|---|
| GitHub Hot Daily | Daily 09:00 | Search trending, organize in Chinese |
| compound-system-refresh | Daily 03:00 | Refresh solutions index |
| Unified Memory Maintenance | Daily 04:00 | MEMORY.md + fact_store check |
| AgentMemory Maintenance | Daily 19:00 | MCP server health check |
| Hermes Config Backup | Daily 06:00 | config.yaml backup |
| skill-index-rebuild | Sunday 09:00 | Rebuild skill index |
3.2 Session-Level Automation
Session Start
↓
fact_store(action='search') → Look up existing knowledge
↓
Start Task
↓
Task Complete
↓
compound.sh reflect → Reflection
↓
session-end.sh → Compress + archive + update stats
3.3 Decoupling Principle
None of the transformations modify Hermes core code. This is the most important design decision:
| Component | Location | Relationship with Hermes |
|---|---|---|
| compound-system | ~/.hermes/skills/compound-system/ | Independent skill, no core modification |
| memory-orchestrator | ~/.hermes/skills/memory-orchestrator/ | Independent skill |
| context-engineering-hub | ~/.hermes/skills/context-engineering-hub/ | Independent skill |
| resilient-web-search | ~/.hermes/skills/resilient-web-search/ | Independent skill |
| Maintenance scripts | ~/.hermes/scripts/ | Independent scripts |
| Cron jobs | Hermes cron system | Using native scheduling capabilities |
Why decouple?
- No conflicts when Hermes updates
- Can be tested and iterated independently
- Can be shared with other users
- Low maintenance cost — no need to track upstream changes
Part 4: Design Philosophy
4.1 From Research to Practice
All our decisions are backed by research:
| Research | Finding | Our Application |
|---|---|---|
| ETH Zurich "Evaluating AGENTS.md" | Auto-generation reduces success rate by 3% | Manually write skills, don't auto-generate |
| GitHub 2500+ repo analysis | Three-part edge cases most effective | Skills must have always/ask/never |
| "Lost in the Middle" | Middle info in long contexts gets ignored | Small skills, on-demand loading |
| Longbench | Moderate compression improves output quality | Structured format over paragraphs |
| Mem0 2026 Agent Memory | Multi-layer memory architecture is best | 5-layer memory system |
4.2 Core Insights
- Reducing tokens ≠ reducing understanding: Structured compression can achieve both simultaneously
- Progressive disclosure > full loading: Only load detailed content when needed
- Closed-loop validation is key: Each improvement must be validated; don't optimize blindly
- Automation is a necessity: Manual maintenance is not sustainable
- Decoupling is a principle: Don't depend on the internal implementation of any specific tool
4.3 Directions Still in Iteration
| Direction | Status | Priority |
|---|---|---|
| SkillOpt integration (auto-training skills) | POC complete, 0.825→0.85 target | High |
| ONNX semantic embeddings (fact_store vector search) | Planned | Medium |
| Cross-project skill reuse | Concept proof | Medium |
| LLM-driven reflect.sh integration | Implemented | ✅ |
Summary
The core idea behind adding an external enhancement system to Hermes Agent: Don't modify the core; use plugin mechanisms to extend capabilities.
- Memory system prevents the Agent from forgetting — pitfalls stepped in are remembered, decisions made are retained
- Skill system organizes knowledge — on-demand loading, quality control, automatic elimination
- Automation keeps the system running — session-level/daily/weekly three-tier maintenance, no manual intervention needed
All code is under ~/.hermes/skills/, purely external, never touching the Hermes core.
Written on 2026-06-21, based on practice with Hermes Agent + compound-system + memory-orchestrator