Agent RAG Is More Than Vector Search: Hybrid Retrieval Architecture in Practice

Skill retrieval is itself an RAG system — using retrieved skills to enhance an agent's generation capabilities. This article breaks down the implementation and trade-offs of dual-path hybrid retrieval.

Skill retrieval is itself an RAG system — using retrieved skills to enhance an agent's generation capabilities.

1. The Problem

In our skill system with 56 skills, before each task begins we need to find the most relevant skills from the skill library and load them into the prompt. This is essentially an RAG problem:

Query  = Current task description
Corpus = 56 SKILL.md files
Goal   = Recall the top 3-5 most relevant skills

Most RAG tutorials only cover vector search. But in production, pure vector search is insufficient.

2. Dual-Path Retrieval Architecture

Instead of a pure vector approach, we use a hybrid scheme combining structural recall and vector recall:

Query
  │
  ├─→ Structural Recall
  │   ├─ Entity matching: query keywords vs skill provides/requires/tags
  │   ├─ Entity expansion: seed entity → cosine neighbors → candidate set
  │   └─ Scoring: base_score + overlap_weight × min(entity_overlap, cap)
  │
  └─→ Vector Recall
      ├─ Query embedding (384d) vs all skill embeddings
      ├─ Cosine similarity > threshold → candidate set
      └─ Scoring: similarity × vector_weight
  │
  └─→ Fusion
      ├─ Both paths hit → structural + vector + both_boost
      ├─ Only structural → structural score
      └─ Only vector → vector score
      └─ Sort by score → top-k

3. Embedding Strategy

# Model: paraphrase-multilingual-MiniLM-L12-v2 (384-dim, CPU-runnable)
# Text construction: name + description + tags + triggers + body_preview
text = f"{skill['name']} {skill['description']} {' '.join(skill['tags'])} ..."
embedding = model.encode(text, normalize_embeddings=True)
 
# Fallback strategy: sentence-transformers unavailable → hash-based embedding
# MD5 bucketing + sign bit → 384-dim sparse vector → cosine still usable

Why this model?

384 dimensions, CPU inference at 2ms/entry
Multilingual support (skill names are mixed Chinese and English)
Mature sentence-transformers ecosystem, pip install and go

The fallback strategy is essential for production — if the embedding model fails to load (OOM, version conflict), the system must not crash. Hash-based embedding has lower accuracy but ensures retrieval never breaks.

4. Why "Hybrid" Instead of "Pure Vector"

Scenario	Pure Vector	Hybrid
"deploy server" → server-operations	✅ Semantic match	✅ + exact tag match
"code-review" → code-review	✅ But might match code-style	✅ Exact provides match
"debug error" → debugging-toolkit	✅	✅ + when_to_apply match
Typo "dubug" → debugging	❌ Embedding drifts	✅ Structural recall catches it

Conclusion: Structural recall handles exact matching and fault tolerance, while vector recall handles semantic generalization. The two complement each other.

5. Fusion Strategy

After dual-path retrieval, we need fused ranking. We experimented with three strategies:

Strategy	Effect	Issue
Weighted Average	Simple and controllable	Thresholds hard to tune; one path's high score can overwhelm the other
RRF (Reciprocal Rank Fusion)	Rank-sensitive, robust	Ignores absolute score values
Conditional Fusion (currently in use)	Bonus for dual hits, penalty for single path	Requires tuning both_boost coefficient

We ultimately chose conditional fusion because it lets us push skills "hit by both paths" to the top — empirically, these tend to be the most relevant.

6. Production Considerations

Caching: Precompute all skill embeddings at build time; only the query needs encoding at runtime
Latency: One retrieval ≈ 5ms (embedding) + 1ms (cosine) + 0.5ms (structural)
Updates: New/modified skills → incremental embedding index update, no full rebuild needed
Degradation chain: Normal → Degrade to pure vector → Degrade to keyword LIKE query → Degrade to random return

7. Key Differences from General RAG

In our case, the "G" in RAG is not external documents but a skill library. This means:

What's retrieved is not "information" but "behavior patterns"
Relevance is not semantic similarity, but "can this skill handle the current task"
The impact of a wrong retrieval is not "answering off-topic" but "the Agent uses the wrong method"

Thus, retrieval quality is even more critical than in general RAG — recalling an irrelevant skill can steer the entire task in the wrong direction.

Conclusion: Hybrid retrieval + conditional fusion + three-stage degradation is the reliable solution we've validated in production.