Project Deep Dive

Knowledge Agent with Hybrid RAG

Phase 2 upgrades the research agent into a persistent knowledge system that stores evidence, retrieves relevant chunks, reranks context, verifies claims, and routes intelligently between local memory and live web search.

RAG Dense Retrieval BM25 RRF Cross-Encoder Verification

TL;DR

Phase 1 could research, but it could not remember. Phase 2 introduces persistent knowledge and hybrid retrieval, turning one-shot web reasoning into a reusable memory system with improved routing accuracy and higher content reliability.

Tool Routing Accuracy
100%
KB vs web routing decisions
Content Hit Rate
92%
Relevant context retrieval coverage
Overall Pass Rate
92%
Evaluation harness result
Mean Relevance
-4.0948
Current scoring baseline

The design goal was not just better answers, but better system behavior: reusable knowledge, explicit retrieval logic, and verifiable claims under noisy inputs.


Architecture Flow

Hybrid RAG Pipeline
User Query enters orchestration layer
→ Routing Decision knowledge-base path or web path
query_knowledge_base / web_search
→ Dense Retrieval + BM25
→ Reciprocal Rank Fusion (RRF)
→ Cross-Encoder Reranking
verify_claim
→ Answer Synthesis
→ Optional save_to_knowledge_base

Core Mechanics

Store

Chunked ingestion persists useful evidence so future runs can reuse context instead of re-fetching everything from the web.

Retrieve + Rerank

Dense retrieval and BM25 broaden recall; RRF and cross-encoder reranking sharpen final relevance.

Verify + Generate

Claim checks reduce unsupported synthesis and force tighter alignment between generated responses and retrieved evidence.

Technical Deep Dive

Embeddings (Dense Retrieval)

Similarity is computed through cosine distance, enabling semantic retrieval beyond exact keyword overlap.

Cosine Similarity
cosine_similarity = (A · B) / (||A|| ||B||)

Hybrid Search (Dense + BM25)

RRF blends lexical and semantic retrieval so exact-match and meaning-match evidence can both survive the first cut.

RRF Score
score = 1/(rank_dense + 60) + 1/(rank_bm25 + 60)

Ingestion Pipeline

Paragraph-first splitting preserves semantic coherence, with fallback chunking when source structure is weak.

Cross-Encoder Reranking

Query and candidate text are scored jointly, improving precision over independent bi-encoder scoring.

Dual-Tool Routing

Strong KB match uses local memory. Weak match falls back to live web search to prevent brittle local-only behavior.

Concrete Retrieval Example

Query: "What is context poisoning in RAG?"
1. BM25 finds exact keyword-aligned chunk.
2. Dense retrieval surfaces semantically related chunk.
3. RRF merges both rankings into a unified candidate list.
4. Cross-encoder scores relevance jointly with query.
5. Highest-confidence chunk is selected for synthesis.

Claim Verification Layer

verify_claim Logic

Claim-to-evidence similarity threshold gates support status.

Support Rule
if cosine_similarity >= 0.65:
    status = "SUPPORTED"

Observed Behavior

Example: claim/chunk similarity of 0.71 is accepted as supported, reducing weakly grounded response risk.


Challenges, Failures, and Pivots

Routing Failures

Resolved by explicit prompt routing rules and clearer dispatch boundaries for KB vs web tools.

Context Poisoning

Mitigated through cross-encoder reranking to demote noisy high-recall chunks.

BM25 Invalidation Bug

Fixed by rebuilding BM25 index independently, eliminating stale lexical scoring artifacts.

Evaluation Mismatch

Resolved by sharing a consistent SYSTEM_PROMPT between runtime and evaluation harness.

Evaluation Results

Metric Result Interpretation
Tool Routing Accuracy 100% Routing policy is currently robust under tested prompts.
Content Hit Rate 92% Most required evidence was recovered in retrieval passes.
Overall Pass Rate 92% One failure remains tied to keyword mismatch behavior.
Mean Relevance -4.0948 Current relevance baseline to improve in future iterations.

Key Learnings

Retrieval quality dominates

Better model output depends first on better evidence selection, not only larger models.

Hybrid beats single retrieval

BM25 + dense search improves recall breadth before reranking precision kicks in.

Verification reduces hallucination risk

Claim gating forces synthesis to remain closer to available evidence.

Eval must mirror production

Prompt inconsistency between eval and runtime can hide real regressions.

Silent bugs are dangerous

The BM25 invalidation issue proved that retrieval pipelines can fail quietly.

Persistent memory changes behavior

Saved knowledge compounds agent performance over repeated domain queries.

Phase 3 Direction

Current Gaps

  • No self-correction execution loop yet.
  • No direct execution feedback cycle.
  • Cost tracking is incomplete.

Next Upgrade

  • LangGraph state-machine orchestration.
  • Execution → diagnose → patch loop.
  • Tighter runtime observability and control.

Appendix: Run Locally

Quickstart
python -m scripts.ingest
python -m scripts.evaluate
python main.py

Code references: src/knowledge_agent/tools.py, main.py

Repository: knowledge-agent on GitHub →

← Back to Learning Log