Golden Data-Driven Automatic Evaluation Pipeline

Automatic evaluation is the foundation of optimization — without evaluation, optimization is like a blind man groping an elephant.

Automatic evaluation is the foundation of optimization — without evaluation, optimization is like a blind man groping an elephant.

1. Golden Data-Driven

data/eval/
├─ tickets_eval.csv          # Evaluation tickets (query + metadata)
├─ golden_expectations.csv   # Expected results (intent + risk + severity)
└─ golden_sag.csv            # Expected retrieval results (query → relevant_doc_ids)

2. Evaluation Dimensions

Dimension	Metric	Description
Intent	accuracy, F1 per class	8-class intent classification accuracy
Severity	accuracy, macro-F1	Severity grading (LOW/MEDIUM/HIGH)
Risk	F1, precision, recall	6-class risk label detection
SAG	recall@k, MRR	Retrieval recall for relevant documents
No-auto-send	accuracy	Identification of tickets that should not be auto-sent
Composite	weighted score	Composite score (0-1)

3. Evaluation → Optimization Loop

Golden Data
    │
    ▼
Evaluate (intent + severity + risk + SAG)
    │
    ▼
Diagnose (which cases failed? what is the pattern?)
    │
    ├─→ Rule Fix (keyword/threshold adjustment)
    ├─→ ML Fix (FastText retraining)
    ├─→ LLM Fix (rule mutation suggestions)
    │
    ▼
Verify (full regression test)
    │
    ├─→ Pass → Apply
    └─→ Fail → Rollback
    │
    ▼
Record (evaluation results persisted to JSON)
    │
    └─→ Next optimization round (resume with --resume)

4. Full Data Flow

┌──────────────────────────────────────────────────────────────┐
│                    User Request                                │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  SAG Retrieval (RAG)                                         │
│  ├─ Structural Recall (entity matching + expansion)          │
│  ├─ Vector Recall (embedding cosine)                         │
│  ├─ Fusion (both_boost)                                      │
│  └─ Thompson Sampling ranking                                │
└──────────────────────┬───────────────────────────────────────┘
                       │ top-k skills
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  Skill Injection (SAG)                                       │
│  ├─ principle → behavioral guidance                          │
│  ├─ common_mistakes → negative guidance                      │
│  └─ when_to_apply → context matching                         │
└──────────────────────┬───────────────────────────────────────┘
                       │ augmented prompt
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  LLM Generation + Task Execution                            │
└──────────────────────┬───────────────────────────────────────┘
                       │ outcome
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  Feedback Loop                                                │
│  ├─ .usage.json (success_rate updates)                        │
│  ├─ failure_log.jsonl (failure records)                       │
│  ├─ compound.sh reflect (reflection judgment)                 │
│  └─ skill evolution (failure-driven evolution)                │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│  Auto-Evaluation Pipeline                                     │
│  ├─ Golden data matching                                     │
│  ├─ Intent/Severity/Risk/SAG evaluation                      │
│  ├─ Diagnose → Fix → Verify → Apply                          │
│  └─ Optimizers (Rules → FastText → NSGA-II → LLM)           │
└──────────────────────────────────────────────────────────────┘