Golden Data-Driven Automatic Evaluation Pipeline
Automatic evaluation is the foundation of optimization — without evaluation, optimization is like a blind man groping an elephant.
Automatic evaluation is the foundation of optimization — without evaluation, optimization is like a blind man groping an elephant.
1. Golden Data-Driven
data/eval/
├─ tickets_eval.csv # Evaluation tickets (query + metadata)
├─ golden_expectations.csv # Expected results (intent + risk + severity)
└─ golden_sag.csv # Expected retrieval results (query → relevant_doc_ids)
2. Evaluation Dimensions
| Dimension | Metric | Description |
|---|---|---|
| Intent | accuracy, F1 per class | 8-class intent classification accuracy |
| Severity | accuracy, macro-F1 | Severity grading (LOW/MEDIUM/HIGH) |
| Risk | F1, precision, recall | 6-class risk label detection |
| SAG | recall@k, MRR | Retrieval recall for relevant documents |
| No-auto-send | accuracy | Identification of tickets that should not be auto-sent |
| Composite | weighted score | Composite score (0-1) |
3. Evaluation → Optimization Loop
Golden Data
│
▼
Evaluate (intent + severity + risk + SAG)
│
▼
Diagnose (which cases failed? what is the pattern?)
│
├─→ Rule Fix (keyword/threshold adjustment)
├─→ ML Fix (FastText retraining)
├─→ LLM Fix (rule mutation suggestions)
│
▼
Verify (full regression test)
│
├─→ Pass → Apply
└─→ Fail → Rollback
│
▼
Record (evaluation results persisted to JSON)
│
└─→ Next optimization round (resume with --resume)
4. Full Data Flow
┌──────────────────────────────────────────────────────────────┐
│ User Request │
└──────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ SAG Retrieval (RAG) │
│ ├─ Structural Recall (entity matching + expansion) │
│ ├─ Vector Recall (embedding cosine) │
│ ├─ Fusion (both_boost) │
│ └─ Thompson Sampling ranking │
└──────────────────────┬───────────────────────────────────────┘
│ top-k skills
▼
┌──────────────────────────────────────────────────────────────┐
│ Skill Injection (SAG) │
│ ├─ principle → behavioral guidance │
│ ├─ common_mistakes → negative guidance │
│ └─ when_to_apply → context matching │
└──────────────────────┬───────────────────────────────────────┘
│ augmented prompt
▼
┌──────────────────────────────────────────────────────────────┐
│ LLM Generation + Task Execution │
└──────────────────────┬───────────────────────────────────────┘
│ outcome
▼
┌──────────────────────────────────────────────────────────────┐
│ Feedback Loop │
│ ├─ .usage.json (success_rate updates) │
│ ├─ failure_log.jsonl (failure records) │
│ ├─ compound.sh reflect (reflection judgment) │
│ └─ skill evolution (failure-driven evolution) │
└──────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Auto-Evaluation Pipeline │
│ ├─ Golden data matching │
│ ├─ Intent/Severity/Risk/SAG evaluation │
│ ├─ Diagnose → Fix → Verify → Apply │
│ └─ Optimizers (Rules → FastText → NSGA-II → LLM) │
└──────────────────────────────────────────────────────────────┘