Command Palette

Search for a command to run...

0

Agent Self-Evolution: From Half a Year of Practice to a Trajectory Mining Pipeline

A hands-on account of building a self-evolving agent system on Hermes Agent: from surveying 8 papers to shipping a 4-agent pipeline, from the symbol-vs-parameter debate to the Executor-Curator separation architecture, from 196 sessions of chat history to automated intelligence extraction.

A hands-on account of building a self-evolving agent system on Hermes Agent: from surveying 8 papers to shipping a 4-agent pipeline, from the symbol-vs-parameter debate to the Executor-Curator separation architecture, from 196 sessions of chat history to automated intelligence extraction.

1. The Problem

After running an agent system daily for half a year, one problem stands out above all:

Chat history keeps growing, but its value decays.

Our state.db holds 196 sessions with 29k messages. A single complex session might use hundreds of messages and dozens of tool calls — but unless we revisit it, that experience sleeps forever in the database. Our weekly session_prune was, in effect, destroying an unmined knowledge deposit.

The deeper issue: agents repeat the same mistakes. A git error today, the same git error next week, debugging from scratch each time. What's missing isn't remembering the answer (that's RAG's job) — it's distilling reusable behavioral patterns from experience. That is, skills.

This isn't unique to us. Between March and May 2026, a wave of papers suddenly appeared, all probing the same question: Trace2Skill, CoEvoSkills, SkillX, SkillClaw, SkillOpt, SkillOS, SKILL0, Skill1. Eight papers from different institutions, nearly simultaneously.

How can an agent learn from its own operational traces and evolve?


2. Two Routes: Symbol vs. Parameter

The eight papers fall cleanly into two camps.

Route A: Symbolic/Text Evolution

Skill = an external text file (SKILL.md). Readable, writable, auditable. LLMs read, write, merge, and edit to evolve.

WorkSkill FormatEvolution Method
Trace2SkillSingle skill.mdOffline batch → parallel patches → hierarchical merge
CoEvoSkillsMulti-file packageOnline Generator ↔ Verifier adversarial co-evolution
SkillX3-tier KBOffline construction + active exploration
SkillClawText skillCross-user trajectory aggregation → continuous update
SkillOptskill.mdDL-style training (learning rate / batch / gate)
SkillOSSingle MarkdownRL-trained Curator manages lifecycle online

Route B: Parameter Internalization

Skill = parameter knowledge within model weights. RL training internalizes skills toward zero-shot execution.

WorkInternalizationOptimization Target
SKILL0Dynamic Curriculum: full context → gradual withdrawalZero-shot without skill injection
Skill1Signal decomposition: low-pass → selection, high-pass → distillationSingle policy does three things

The Core Trade-off

DimensionSymbolic RouteParameter Route
Auditability✅ Fully readable❌ Weights are black boxes
Cross-model migration✅ Copy & paste❌ Requires retraining
Inference token cost❌ 3-5K per injection✅ Zero-shot <0.5K
Maintenance complexityLow (git)High (RL pipeline)
Best forMulti-model, multi-platform, auditableFixed model, high-frequency tasks

Our choice: symbolic route short-term, hybrid long-term. Hermes switches between models (deepseek, gemini, claude…) — cross-model portability is non-negotiable. But SkillOS's RL-trained Curator inspired us: eventually, manage the symbolic skill library with a trained policy, and selectively internalize high-frequency skills.


3. Architectural Watershed: Executor-Curator Separation

SkillOS's Executor-Curator separation is the architectural watershed.

Core idea: The agent (Executor) and the skill manager (Curator) should be completely independent systems. The agent just does the work. The Curator handles skill creation, merging, pruning, and evaluation. The Curator has its own objective function (downstream task success rate), its own training loop, fully decoupled from the agent's language model.

SkillOS's striking result: an 8B parameter specialized Curator outperforms Gemini 2.5 Pro with zero-shot prompting. Small model + training > large model + generic prompt.

Looking back, our Hermes system had unconsciously implemented this:

  • Executor: Hermes Agent (runs tasks, calls tools, writes code)
  • Curator: Deterministic pipeline (compound-system reflection, duplicate skill merging, decay-check)
  • Peripheral systems: SkillOpt training, pattern-detector, validation-set

We just hadn't recognized the pattern. Our Curator was deterministic rules, not an RL policy. Naming the pattern unified our design framework going forward.


4. Trajectory Mining Pipeline: From Chat Logs to Intelligence

With the Executor-Curator framework, how? We designed a 4-agent pipeline based on Trace2Skill, Skill-DisCo, AutoRefine, and Vadim's Trajectory Miner.

Architecture

Trace Ingestor → Pattern Miner → Knowledge Distiller → Validation Curator
     │                │                  │                     │
 state.db       compound-system       fact_store           validation-set
                /bugs  /knowledge     +MEMORY.md

Four roles, fully separated:

  1. Trace Ingestor: Reads state.db, produces structured session summaries. No analysis, no judgment — just data cleaning.
  2. Pattern Miner: Cross-session pattern discovery — recurring errors (≥2 sessions), success patterns (≥2 same task type), source insights (error rate > 30%). Every pattern must pass self-questioning: root cause or symptom?
  3. Knowledge Distiller: Simple signals (e.g., "qqbot source has 67% error rate") → fact_store; complex patterns → compound-system/knowledge. Prioritized by AutoRefine's effectiveness × log(frequency) × precision.
  4. Validation Curator: Extracts validation tasks from successful sessions, deduplicates, maintains a 30-task cap.

7 Design Decisions

All validated by papers, not guesswork:

#DecisionSourceEffect
1Parallel batch consolidation > sequential editingTrace2Skill3min vs 60min, higher quality
2Pure analyst role: produce intelligence, not codeVadimAuditable, reversible
3≥2 threshold + self-questioningVadim + AutoRefine13/62 false positives filtered round 1
4Asymmetric failure analysis: deep vs shallowTrace2SkillRicher metadata on failed sessions
5Compile + verify closed loopSkill-DisCo0% execution error (paper claim)
6Maintenance-score-driven pruningAutoRefineLow-score patterns auto-degrade
7Dual-form: simple→fact_store, complex→compound-systemAutoRefineSignals allocated by density

P1: Foundation (Harvest + Mine)

First version: ship the pipeline, worry about quality later.

trajectory-miner.py harvest mine write-solutions

Round 1 results:

MetricValue
Sessions analyzed189
Real patterns found49
False positives filtered13
Written to compound-system bugs43
Written to compound-system knowledge6

One interesting finding: all sources had error rates between 52-66%. Too uniform to be coincidence — more likely our detection threshold was too aggressive. This directly informed P2's false-positive improvements.

P2: Knowledge Distillation + Validation Curation

Two downstream consumers:

  1. Knowledge Distiller: Produces fact-staging.json (5 candidates with scores), memory-staging.md (5 MEMORY.md suggestions), 3 compound-system/knowledge aggregate entries (error ranking, success aggregation, task distribution).

  2. Validation Curator: Extracts tasks from successful sessions, deduplicates (15/27 already existed), rotates to maintain 30-task cap.

Full pipeline: < 30 seconds.


5. Parallel Engineering Work

The pipeline was the main thread, but not the only one.

1. Memory System Refactor

19 fragmented scripts → 3 unified scripts (memory-core.py for maintenance, memory-sync.py for sync, memory-health.py for monitoring). 5 cron jobs instead of 11. Memory usage: 2450B → 1374B.

2. state.db Optimization

Two-layer cleanup: FTS5 rebuild to eliminate content duplication (351 MB → 200 MB), then delete tool messages from sessions >7 days old (200 MB → 134 MB). -62% total.

Principle: raw chat logs are a temporary cache; permanent records live in fact_store + compound-system + MEMORY.md.

3. SkillOpt Integration

SkillOpt's optimized skills auto-ingest into compound-system and the skill library — a training → validation → deployment closed loop.

4. Validation Set Construction

Auto-built from session history: started at 15 tasks, expanded to 30 via Validation Curator, covering success and failure scenarios.

5. Pi Trace Ingestion

Pi coding agent session data is periodically ingested via a dedicated script and unified with Hermes traces.


6. Execution Schedule

The final automated chain:

Sun 01:00  Trajectory Pipeline (all 4 phases)
Sun 01:45  fact_store application (LLM-driven cron)
Sun 02:00  session_prune (safe deletion)

Daily:

  • 03:00 SkillOpt auto-evolution + compound-system refresh
  • 04:00 memory-core maintenance
  • 05:00 Pi trace ingestion
  • 06:00 Curator validation + Hermes config backup
  • 07:00 system-health + skill-rejuvenate
  • 08:00 memory-health + skillopt-ingest
  • 09:00 validation-set-refresh + skill-index-rebuild

22 weekly cron jobs, forming a self-sustaining evolution system. Problems get found, knowledge gets distilled, skills get updated — without human intervention.


7. Lessons Learned

What Worked

1. Research first, code second.

Reading 8 papers + 4 blog posts before writing a line of code saved us from blind implementation. Without Trace2Skill, we'd have built a sequential pipeline and rediscovered its 20x slowdown.

2. Pure analyst role.

The pipeline doesn't write skills, edit code, or make decisions — it only produces structured intelligence files. This felt constraining at first, but it's the reason the system stays clean: single responsibility, defined output formats, downstream consumers free to choose.

3. ≥2 threshold + self-questioning.

Single occurrence = incident. Two occurrences = pattern. Every pattern must answer "root cause or symptom?" This killed a lot of noise (13/62 signals filtered in round 1).

4. Ship the link first, improve quality later.

P1 only did harvest + mine. Ugly filenames, imperfect entries. But it ran end-to-end. P2 added distillation and curation. Progressive delivery.

5. Pipeline runs before prune.

Extract value before deleting data. This timing guarantees state.db is always consumed before cleaned.

What Could Be Better

1. P1 title quality.

Bug filenames are raw JSON keys: 2026-07-05-auto-success___true___diff____---_a__path__n____b.md. Content has proper frontmatter, but it hurts readability.

2. False positive threshold.

Currently flags everything with error/fail/❌ as an error, producing 52-66% error rates across all sources. Many are harmless tool permission warnings and empty outputs. Needs a finer-grained signal classifier.

3. Validation set lacks feedback loop.

Currently one-directional: add tasks. No "this validation passed/failed → adjust pipeline threshold" feedback.

4. fact-staging.json requires LLM to consume.

fact_store is an agent tool; a pure script can't write to it. The LLM cron bridge adds architectural complexity.


8. Further Thoughts

The Two Routes Are Converging

SkillOpt + SkillOS already absorb RL ideas (learning rate, batch, gate, RL-trained curator). SKILL0 + Skill1 explore "which skills are worth internalizing."

I believe the final form will be: an RL-trained Curator manages the external skill library (selection, ranking, composition), while high-frequency high-value skills are selectively internalized into model parameters. The Executor only cares about the current task; the Curator provides the best skill combination for the context window. The Curator can be a small model + deterministic rules — it doesn't need a large model.

No System Does All Five Things

The complete capability set:

  1. Extract patterns from traces ✓ (Trajectory Miner / Trace2Skill)
  2. Verify quality improvements ✓ (Validation Gate / CoEvoSkills)
  3. Share across users (SkillClaw direction, not yet implemented)
  4. Optimize with controlled iteration ✓ (SkillOpt)
  5. Maintain skill library lifecycle ✓ (Curator / SkillOS)

We cover 1/2/4/5. What's missing is 3 — cross-user sharing. One user's hard-earned lesson automatically benefits others. This is SkillClaw's insight, and the most natural value proposition for SaaS agent products.

The Trend Is Irreversible

March-May 2026: eight papers, nearly simultaneously. Not a coincidence. The agent field is shifting from "prompt-engineering a single model" to building systems that accumulate experience. Like software engineering moving from procedural to object-oriented programming — except this time, the units aren't code, they're experience.

Our pipeline isn't the end. It just proves that you can automatically extract intelligence from chat logs. The next question isn't "can we?" but "how do we consume this intelligence to maximize the agent's long-term performance?"

We're still exploring that.


References

  1. Ni et al., Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills, arXiv:2603.25158, 2026.
  2. He et al., CoEvoSkills: Co-evolving AI Agents with Diverse Skill Packs, arXiv:2604.01687, 2026.
  3. Yang et al., SkillX: Evolving AI Agent Skill from Number to Knowledge Base, arXiv:2604.04804, 2026.
  4. Wu et al., SkillClaw: Multi-agent Skill Evolution from Collective Agent Trajectories, arXiv:2604.08377, 2026.
  5. Wang et al., SKILL0: Zero-Shot Agent with Internalized Skills, arXiv:2604.02268, 2026.
  6. Hu et al., SkillOpt: Optimizing Agent Skills through Textual Learning Rate, arXiv:2605.23904, 2026.
  7. An et al., SkillOS: On a Unified Skill Operating System for LLM Agents, arXiv:2605.06614, 2026.
  8. Guo et al., Skill-DisCo: Distilling and Compiling Agent Traces into Reusable Procedural Skills, arXiv:2606.26669, 2026.
  9. Vadim Nicolai, Why Do AI Agents Keep Making the Same Mistakes?, 2026. vadim.blog/trajectory-miner-research-to-practice