After Reading Skill-MAS: How Far Is Your Skill System from 'Auto-Evolution'?

arXiv 2606.18837 tells us: the orchestration capability of a Meta-agent can be written as an auto-evolvable skill text. And we already have 56 skills, a compound-system, and several agent frameworks — how do we play this hand?

arXiv 2606.18837 tells us: the orchestration capability of a Meta-agent can be written as an auto-evolvable skill text. And we already have 56 skills, a compound-system, and several agent frameworks — how do we play this hand.

TL;DR

Skill-MAS's core insight: The act of organizing multiple agents itself can be written as a skill file (Meta-Skill), then iteratively improved through multi-trajectory execution + selective reflection + contrastive diagnosis.
This aligns perfectly with what we already have — skills + compound-system + Baby Harness is essentially the prototype of this architecture. The gap lies in our lack of a systematic iteration closed-loop.
Five things we can do right now: multi-trajectory execution, uncertainty awareness, contrastive diagnosis, abstraction gate, multi-round best-policy retention.
Biggest caveat: The paper relies on ground-truth labels; we need to go the self-supervised route to run in production.

1. A Paper That Validates Our Intuition

I recently came across Skill-MAS (arXiv 2606.18837, HKUST-GZ + Ant Group), and my first reaction was: This theorizes what we've been doing.

For the past six months, we've been working on one thing: distilling the Agent's "way of doing things" into reusable skill texts (SKILL.md), then using compound-system to reflect and record lessons after each task. The natural extension is — not only can sub-agent execution skills evolve, but the "meta-capability" of how Meta-agents organize collaboration can also be written as skills and iterated automatically.

Skill-MAS answers exactly that. It abstracts the Meta-agent's orchestration capability into a three-module Meta-Skill:

Task Decomposition     → "How to break down tasks"
Agent Engineering      → "Who does what"
Workflow Orchestration → "How to connect the workflow"

Then repeatedly executes → reflects → rewrites this skill on a validation set. The key insight: it doesn't change model parameters — only the text.

This is perfectly aligned with our "skills as code" philosophy. And it proves several hypotheses we believed but never validated:

Skills are transferable across different LLMs (significant improvement still holds when switching models on the same task)
Skills are transferable across different tasks (improvement is equally substantial when switching tasks with the same model)
Modular skill decomposition pinpoints faults more accurately (three modules → you know which one to fix when something goes wrong)

2. What We Already Have

Let's take stock of our assets and their correspondence to Skill-MAS:

Skill-MAS Component	Our Counterpart	Status	Gap
Meta-Skill (Orchestration Text)	`.hermes/skills/` SKILL.md	✅ 56 skills	Missing orchestration-specific module structure
Runtime MAS Generation	Baby Harness agent teams	✅ Multi-agent orchestration	Skills not used to guide orchestration strategy
Reflection / Knowledge Distillation	compound-system reflect	✅ Post-task reflection	Single-run, no contrast, no multi-trajectory perspective
Selective Reflection (prioritization)	None	❌	Every issue treated equally
Multi-Trajectory Execution (K=5 sampling)	Single execution	❌	Can't distinguish "vague rules" from "insufficient ability"
Hierarchical Contrastive Analysis	None	❌	No comparison between good and bad trajectories
Abstraction Constraint	None	❌	Skill modifications can become single-case patches
Multi-Round + Best-Policy Retention	Manual patch	❌	No going back once changed

Key finding: We have the raw materials (skills) and the reflection engine (compound-system), but we are missing the iteration closed-loop that connects them. Skill-MAS provides exactly the standard interface for this loop.

3. Five Designs to Absorb Immediately

3.1 Multi-Trajectory Execution: From Point Estimate to Distribution Estimate

Our current execution strategy is a one-shot affair: one task run once, succeed or fail. But whether a skill is good or not, a single execution doesn't tell you much — it could just be random noise.

Skill-MAS samples K=5 runs for the same task and computes two statistics:

uncertainty = std of scores  → whether the rule is fuzzy
difficulty  = negative mean score → whether the task is genuinely hard

High uncertainty means the skill's rules aren't clear enough — sometimes guessing right, sometimes wrong. High difficulty means the task truly requires stronger capability. Both high = the most worthy optimization target.

What we can do: In compound-system's validation step, for boundary tasks, automatically run 3~5 iterations and calculate the standard deviation. If the same prompt shows large variance across different runs, it means the skill's rules need refinement, not that the capability is insufficient.

3.2 Selective Reflection: Spend Energy on What Matters

Our compound-system reflects on all tasks equally. But experience tells us: some failures deserve deep analysis, others are just noise.

Skill-MAS uses elbow truncation (second-order difference to find the natural cutoff point) to select a small number of high-priority tasks for deep reflective revision, rather than spreading effort evenly.

The formula is simple:

p_i = 0.5 * normalize(u_i) + 0.5 * normalize(d_i)

Sort by p_i, truncate at the "elbow" of the priority curve. Tasks past the elbow have diminishing value.

What we can do: After gate-4 code review, compute the "uncertainty" (test instability) and "difficulty" (code complexity / issue reproducibility) of the change, and decide whether compound-system deep reflection is needed, or just a quick pass.

3.3 Contrastive Diagnosis: Let Good and Bad Trajectories Talk

compound-system's current reflection is narrative — "what went wrong this time." But Skill-MAS uses a much more powerful method: compare the execution trajectories of high-scoring and low-scoring groups.

For the same task, divide K runs into two groups (above median vs below median), and ask the LLM to compare the divergence point — at which decision step the trajectories start to differ.

Phase 1: Intra-task comparison
  High-score trajectory vs Low-score trajectory
  → Find divergence point
  → Extract success factors
  → Diagnose root cause of failure

Phase 2: Cross-task synthesis
  Merge findings from multiple tasks
  → Identify systematic patterns (good and bad)
  → Produce evidence package

What we can do: Change the entry point of compound-system reflection from "describe the problem" to "contrastive analysis." For repeated failures of the same type, first collect data, then do contrastive analysis, then write conclusions.

3.4 Abstraction Gate: No Single-Case Patches

One design constraint from the paper that I strongly agree with: each skill modification must be abstracted into a general orchestration principle; it cannot be written as a single-case patch.

This means: if the system fails on Task-A, and you write the fix as "do X for Task-A," it won't pass review. You must write it as "when the task type is X, first do Y, then do Z" — elevating the specific case to a principle.

What we can do: Add a check in the skill editing workflow (gate-3 skill update): does the newly written rule contain a specific task name or dataset name? If so, send it back for rewriting. Only abstract orchestration principles can enter the skill.

3.5 Multi-Round Best-Policy Retention: It's Only Evolution If You Can't Break It

Skill-MAS retains the previous version of the skill after each iteration, and after R rounds selects the best-performing version on the validation set as the final version. This seems simple, but our current skill modifications are typically "one-way doors" — once changed, there's no going back.

What we can do: Add git branch management to the skill update workflow. Each auto-evolution creates a branch, retains all historical versions, and finally merges the best-performing version on the validation set back to main. This way, even if an evolution goes wrong, it won't contaminate already stable skills.

4. A Limitation Worth Heeding

The paper candidly acknowledges a key limitation: Selection and reflection rely on ground-truth labels.

Specifically, computing the priority score requires knowing the ground-truth answer for each task, in order to calculate difficulty = -mean(score). In real production environments, most tasks don't have ground-truth labels.

The paper's mitigation is relatively weak ("future work can use LLM-as-judge"). We've already accumulated LLM-as-judge practice in our ai-evaluation skill. This path is viable, but needs careful design:

Key risk: self-fulfilling prophecy — using an LLM to judge an LLM, where the evaluation criteria may share the same blind spots as the model
Our response: Multi-dimensional scoring (correctness + efficiency + maintainability) + a small amount of human annotation for calibration

5. Our Next-Step Roadmap

Combining the insights from Skill-MAS with our existing assets, here's the direction I want to pursue:

Phase 1 (Do Now)
├── Multi-trajectory sampling: Add K-run mode to compound-system, output variance
└── Contrastive diagnosis: Change reflection from "describe the problem" to "compare good/bad trajectories"

Phase 2 (Short to Mid Term)
├── Selective reflection: Prioritize based on uncertainty + difficulty
├── Abstraction gate: Add rule abstraction check to skill editing
└── Iterative retention: Use git branches for skill updates, select best to merge

Phase 3 (Long Term)
├── Self-supervised evaluation: LLM-as-judge as a replacement for ground-truth
├── Cross-session Meta-Skill persistence
└── Skill update auto MR → review → merge pipeline

6. Final Thoughts

Skill-MAS validates an intuition we've held for a long time: The orchestration capability of an Agent can be written as text, evolved automatically, and transferred across models.

It's not another framework that requires GPUs and training data. Its core upgrade is at the understanding layer — making metacognition explicit (structured skill text) instead of implicit (model parameters / search algorithms). Only what's explicit can be inspected, modified, and reused.

And we happen to already have 56 skills, a compound-system, and a Baby Harness. The cards are already on the table — what's missing is a set of rules to play them.

If you're also building systems and practices around Agent skills, let's talk.

Reference: Lin, Yang, Qin. "Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems." arXiv:2606.18837, June 2026.