Agent Self-Evolution: From Half a Year of Practice to a Trajectory Mining Pipeline
A hands-on account of building a self-evolving agent system on Hermes Agent: from surveying 8 papers to shipping a 4-agent pipeline, from the symbol-vs-parameter debate to the Executor-Curator separation architecture, from 196 sessions of chat history to automated intelligence extraction.
A hands-on account of building a self-evolving agent system on Hermes Agent: from surveying 8 papers to shipping a 4-agent pipeline, from the symbol-vs-parameter debate to the Executor-Curator separation architecture, from 196 sessions of chat history to automated intelligence extraction.
1. The Problem
After running an agent system daily for half a year, one problem stands out above all:
Chat history keeps growing, but its value decays.
Our state.db holds 196 sessions with 29k messages. A single complex session might use hundreds of messages and dozens of tool calls — but unless we revisit it, that experience sleeps forever in the database. Our weekly session_prune was, in effect, destroying an unmined knowledge deposit.
The deeper issue: agents repeat the same mistakes. A git error today, the same git error next week, debugging from scratch each time. What's missing isn't remembering the answer (that's RAG's job) — it's distilling reusable behavioral patterns from experience. That is, skills.
This isn't unique to us. Between March and May 2026, a wave of papers suddenly appeared, all probing the same question: Trace2Skill, CoEvoSkills, SkillX, SkillClaw, SkillOpt, SkillOS, SKILL0, Skill1. Eight papers from different institutions, nearly simultaneously.
How can an agent learn from its own operational traces and evolve?
2. Two Routes: Symbol vs. Parameter
The eight papers fall cleanly into two camps.
Route A: Symbolic/Text Evolution
Skill = an external text file (SKILL.md). Readable, writable, auditable. LLMs read, write, merge, and edit to evolve.
| Work | Skill Format | Evolution Method |
|---|---|---|
| Trace2Skill | Single skill.md | Offline batch → parallel patches → hierarchical merge |
| CoEvoSkills | Multi-file package | Online Generator ↔ Verifier adversarial co-evolution |
| SkillX | 3-tier KB | Offline construction + active exploration |
| SkillClaw | Text skill | Cross-user trajectory aggregation → continuous update |
| SkillOpt | skill.md | DL-style training (learning rate / batch / gate) |
| SkillOS | Single Markdown | RL-trained Curator manages lifecycle online |
Route B: Parameter Internalization
Skill = parameter knowledge within model weights. RL training internalizes skills toward zero-shot execution.
| Work | Internalization | Optimization Target |
|---|---|---|
| SKILL0 | Dynamic Curriculum: full context → gradual withdrawal | Zero-shot without skill injection |
| Skill1 | Signal decomposition: low-pass → selection, high-pass → distillation | Single policy does three things |
The Core Trade-off
| Dimension | Symbolic Route | Parameter Route |
|---|---|---|
| Auditability | ✅ Fully readable | ❌ Weights are black boxes |
| Cross-model migration | ✅ Copy & paste | ❌ Requires retraining |
| Inference token cost | ❌ 3-5K per injection | ✅ Zero-shot <0.5K |
| Maintenance complexity | Low (git) | High (RL pipeline) |
| Best for | Multi-model, multi-platform, auditable | Fixed model, high-frequency tasks |
Our choice: symbolic route short-term, hybrid long-term. Hermes switches between models (deepseek, gemini, claude…) — cross-model portability is non-negotiable. But SkillOS's RL-trained Curator inspired us: eventually, manage the symbolic skill library with a trained policy, and selectively internalize high-frequency skills.
3. Architectural Watershed: Executor-Curator Separation
SkillOS's Executor-Curator separation is the architectural watershed.
Core idea: The agent (Executor) and the skill manager (Curator) should be completely independent systems. The agent just does the work. The Curator handles skill creation, merging, pruning, and evaluation. The Curator has its own objective function (downstream task success rate), its own training loop, fully decoupled from the agent's language model.
SkillOS's striking result: an 8B parameter specialized Curator outperforms Gemini 2.5 Pro with zero-shot prompting. Small model + training > large model + generic prompt.
Looking back, our Hermes system had unconsciously implemented this:
- Executor: Hermes Agent (runs tasks, calls tools, writes code)
- Curator: Deterministic pipeline (compound-system reflection, duplicate skill merging, decay-check)
- Peripheral systems: SkillOpt training, pattern-detector, validation-set
We just hadn't recognized the pattern. Our Curator was deterministic rules, not an RL policy. Naming the pattern unified our design framework going forward.
4. Trajectory Mining Pipeline: From Chat Logs to Intelligence
With the Executor-Curator framework, how? We designed a 4-agent pipeline based on Trace2Skill, Skill-DisCo, AutoRefine, and Vadim's Trajectory Miner.
Architecture
Trace Ingestor → Pattern Miner → Knowledge Distiller → Validation Curator
│ │ │ │
state.db compound-system fact_store validation-set
/bugs /knowledge +MEMORY.mdFour roles, fully separated:
- Trace Ingestor: Reads state.db, produces structured session summaries. No analysis, no judgment — just data cleaning.
- Pattern Miner: Cross-session pattern discovery — recurring errors (≥2 sessions), success patterns (≥2 same task type), source insights (error rate > 30%). Every pattern must pass self-questioning: root cause or symptom?
- Knowledge Distiller: Simple signals (e.g., "qqbot source has 67% error rate") → fact_store; complex patterns → compound-system/knowledge. Prioritized by AutoRefine's
effectiveness × log(frequency) × precision. - Validation Curator: Extracts validation tasks from successful sessions, deduplicates, maintains a 30-task cap.
7 Design Decisions
All validated by papers, not guesswork:
| # | Decision | Source | Effect |
|---|---|---|---|
| 1 | Parallel batch consolidation > sequential editing | Trace2Skill | 3min vs 60min, higher quality |
| 2 | Pure analyst role: produce intelligence, not code | Vadim | Auditable, reversible |
| 3 | ≥2 threshold + self-questioning | Vadim + AutoRefine | 13/62 false positives filtered round 1 |
| 4 | Asymmetric failure analysis: deep vs shallow | Trace2Skill | Richer metadata on failed sessions |
| 5 | Compile + verify closed loop | Skill-DisCo | 0% execution error (paper claim) |
| 6 | Maintenance-score-driven pruning | AutoRefine | Low-score patterns auto-degrade |
| 7 | Dual-form: simple→fact_store, complex→compound-system | AutoRefine | Signals allocated by density |
P1: Foundation (Harvest + Mine)
First version: ship the pipeline, worry about quality later.
trajectory-miner.py harvest → mine → write-solutionsRound 1 results:
| Metric | Value |
|---|---|
| Sessions analyzed | 189 |
| Real patterns found | 49 |
| False positives filtered | 13 |
| Written to compound-system bugs | 43 |
| Written to compound-system knowledge | 6 |
One interesting finding: all sources had error rates between 52-66%. Too uniform to be coincidence — more likely our detection threshold was too aggressive. This directly informed P2's false-positive improvements.
P2: Knowledge Distillation + Validation Curation
Two downstream consumers:
-
Knowledge Distiller: Produces fact-staging.json (5 candidates with scores), memory-staging.md (5 MEMORY.md suggestions), 3 compound-system/knowledge aggregate entries (error ranking, success aggregation, task distribution).
-
Validation Curator: Extracts tasks from successful sessions, deduplicates (15/27 already existed), rotates to maintain 30-task cap.
Full pipeline: < 30 seconds.
5. Parallel Engineering Work
The pipeline was the main thread, but not the only one.
1. Memory System Refactor
19 fragmented scripts → 3 unified scripts (memory-core.py for maintenance, memory-sync.py for sync, memory-health.py for monitoring). 5 cron jobs instead of 11. Memory usage: 2450B → 1374B.
2. state.db Optimization
Two-layer cleanup: FTS5 rebuild to eliminate content duplication (351 MB → 200 MB), then delete tool messages from sessions >7 days old (200 MB → 134 MB). -62% total.
Principle: raw chat logs are a temporary cache; permanent records live in fact_store + compound-system + MEMORY.md.
3. SkillOpt Integration
SkillOpt's optimized skills auto-ingest into compound-system and the skill library — a training → validation → deployment closed loop.
4. Validation Set Construction
Auto-built from session history: started at 15 tasks, expanded to 30 via Validation Curator, covering success and failure scenarios.
5. Pi Trace Ingestion
Pi coding agent session data is periodically ingested via a dedicated script and unified with Hermes traces.
6. Execution Schedule
The final automated chain:
Sun 01:00 Trajectory Pipeline (all 4 phases)
Sun 01:45 fact_store application (LLM-driven cron)
Sun 02:00 session_prune (safe deletion)Daily:
- 03:00 SkillOpt auto-evolution + compound-system refresh
- 04:00 memory-core maintenance
- 05:00 Pi trace ingestion
- 06:00 Curator validation + Hermes config backup
- 07:00 system-health + skill-rejuvenate
- 08:00 memory-health + skillopt-ingest
- 09:00 validation-set-refresh + skill-index-rebuild
22 weekly cron jobs, forming a self-sustaining evolution system. Problems get found, knowledge gets distilled, skills get updated — without human intervention.
7. Lessons Learned
What Worked
1. Research first, code second.
Reading 8 papers + 4 blog posts before writing a line of code saved us from blind implementation. Without Trace2Skill, we'd have built a sequential pipeline and rediscovered its 20x slowdown.
2. Pure analyst role.
The pipeline doesn't write skills, edit code, or make decisions — it only produces structured intelligence files. This felt constraining at first, but it's the reason the system stays clean: single responsibility, defined output formats, downstream consumers free to choose.
3. ≥2 threshold + self-questioning.
Single occurrence = incident. Two occurrences = pattern. Every pattern must answer "root cause or symptom?" This killed a lot of noise (13/62 signals filtered in round 1).
4. Ship the link first, improve quality later.
P1 only did harvest + mine. Ugly filenames, imperfect entries. But it ran end-to-end. P2 added distillation and curation. Progressive delivery.
5. Pipeline runs before prune.
Extract value before deleting data. This timing guarantees state.db is always consumed before cleaned.
What Could Be Better
1. P1 title quality.
Bug filenames are raw JSON keys: 2026-07-05-auto-success___true___diff____---_a__path__n____b.md. Content has proper frontmatter, but it hurts readability.
2. False positive threshold.
Currently flags everything with error/fail/❌ as an error, producing 52-66% error rates across all sources. Many are harmless tool permission warnings and empty outputs. Needs a finer-grained signal classifier.
3. Validation set lacks feedback loop.
Currently one-directional: add tasks. No "this validation passed/failed → adjust pipeline threshold" feedback.
4. fact-staging.json requires LLM to consume.
fact_store is an agent tool; a pure script can't write to it. The LLM cron bridge adds architectural complexity.
8. Further Thoughts
The Two Routes Are Converging
SkillOpt + SkillOS already absorb RL ideas (learning rate, batch, gate, RL-trained curator). SKILL0 + Skill1 explore "which skills are worth internalizing."
I believe the final form will be: an RL-trained Curator manages the external skill library (selection, ranking, composition), while high-frequency high-value skills are selectively internalized into model parameters. The Executor only cares about the current task; the Curator provides the best skill combination for the context window. The Curator can be a small model + deterministic rules — it doesn't need a large model.
No System Does All Five Things
The complete capability set:
- Extract patterns from traces ✓ (Trajectory Miner / Trace2Skill)
- Verify quality improvements ✓ (Validation Gate / CoEvoSkills)
- Share across users (SkillClaw direction, not yet implemented)
- Optimize with controlled iteration ✓ (SkillOpt)
- Maintain skill library lifecycle ✓ (Curator / SkillOS)
We cover 1/2/4/5. What's missing is 3 — cross-user sharing. One user's hard-earned lesson automatically benefits others. This is SkillClaw's insight, and the most natural value proposition for SaaS agent products.
The Trend Is Irreversible
March-May 2026: eight papers, nearly simultaneously. Not a coincidence. The agent field is shifting from "prompt-engineering a single model" to building systems that accumulate experience. Like software engineering moving from procedural to object-oriented programming — except this time, the units aren't code, they're experience.
Our pipeline isn't the end. It just proves that you can automatically extract intelligence from chat logs. The next question isn't "can we?" but "how do we consume this intelligence to maximize the agent's long-term performance?"
We're still exploring that.
References
- Ni et al., Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills, arXiv:2603.25158, 2026.
- He et al., CoEvoSkills: Co-evolving AI Agents with Diverse Skill Packs, arXiv:2604.01687, 2026.
- Yang et al., SkillX: Evolving AI Agent Skill from Number to Knowledge Base, arXiv:2604.04804, 2026.
- Wu et al., SkillClaw: Multi-agent Skill Evolution from Collective Agent Trajectories, arXiv:2604.08377, 2026.
- Wang et al., SKILL0: Zero-Shot Agent with Internalized Skills, arXiv:2604.02268, 2026.
- Hu et al., SkillOpt: Optimizing Agent Skills through Textual Learning Rate, arXiv:2605.23904, 2026.
- An et al., SkillOS: On a Unified Skill Operating System for LLM Agents, arXiv:2605.06614, 2026.
- Guo et al., Skill-DisCo: Distilling and Compiling Agent Traces into Reusable Procedural Skills, arXiv:2606.26669, 2026.
- Vadim Nicolai, Why Do AI Agents Keep Making the Same Mistakes?, 2026. vadim.blog/trajectory-miner-research-to-practice