Knowledge Agent with Hybrid RAG

TL;DR

The earlier CLI research agent could research, but it could not remember. The knowledge agent introduces persistent knowledge and hybrid retrieval, turning one-shot web reasoning into a reusable memory system with improved routing accuracy and higher content reliability.

Tool Routing Accuracy

100%

KB vs web routing decisions

Content Hit Rate

92%

Relevant context retrieval coverage

Overall Pass Rate

92%

Evaluation harness result

Mean Relevance

-4.0948

Current scoring baseline

The design goal was not just better answers, but better system behavior: reusable knowledge, explicit retrieval logic, and verifiable claims under noisy inputs.

Architecture Flow

Hybrid RAG Pipeline

User Query enters orchestration layer

→ Routing Decision knowledge-base path or web path

→ query_knowledge_base / web_search

→ Dense Retrieval + BM25

→ Reciprocal Rank Fusion (RRF)

→ Cross-Encoder Reranking

→ verify_claim

→ Answer Synthesis

→ Optional save_to_knowledge_base

Core Mechanics

Store

Chunked ingestion persists useful evidence so future runs can reuse context instead of re-fetching everything from the web.

Retrieve + Rerank

Dense retrieval and BM25 broaden recall; RRF and cross-encoder reranking sharpen final relevance.

Verify + Generate

Claim checks reduce unsupported synthesis and force tighter alignment between generated responses and retrieved evidence.

Technical Deep Dive

Embeddings (Dense Retrieval)

Similarity is computed through cosine distance, enabling semantic retrieval beyond exact keyword overlap.

Cosine Similarity

cosine_similarity = (A · B) / (||A|| ||B||)

Hybrid Search (Dense + BM25)

RRF blends lexical and semantic retrieval so exact-match and meaning-match evidence can both survive the first cut.

RRF Score

score = 1/(rank_dense + 60) + 1/(rank_bm25 + 60)

Ingestion Pipeline

Paragraph-first splitting preserves semantic coherence, with fallback chunking when source structure is weak.

Cross-Encoder Reranking

Query and candidate text are scored jointly, improving precision over independent bi-encoder scoring.

Dual-Tool Routing

Strong KB match uses local memory. Weak match falls back to live web search to prevent brittle local-only behavior.

Concrete Retrieval Example

Query: "What is context poisoning in RAG?"

1. BM25 finds exact keyword-aligned chunk.

2. Dense retrieval surfaces semantically related chunk.

3. RRF merges both rankings into a unified candidate list.

4. Cross-encoder scores relevance jointly with query.

5. Highest-confidence chunk is selected for synthesis.

Claim Verification Layer

`verify_claim` Logic

Claim-to-evidence similarity threshold gates support status.

Support Rule

if cosine_similarity >= 0.65:
    status = "SUPPORTED"

Observed Behavior

Example: claim/chunk similarity of 0.71 is accepted as supported, reducing weakly grounded response risk.

Challenges, Failures, and Pivots

Routing Failures

Resolved by explicit prompt routing rules and clearer dispatch boundaries for KB vs web tools.

Context Poisoning

Mitigated through cross-encoder reranking to demote noisy high-recall chunks.

BM25 Invalidation Bug

Fixed by rebuilding BM25 index independently, eliminating stale lexical scoring artifacts.

Evaluation Mismatch

Resolved by sharing a consistent SYSTEM_PROMPT between runtime and evaluation harness.

Evaluation Results

Metric	Result	Interpretation
Tool Routing Accuracy	100%	Routing policy is currently robust under tested prompts.
Content Hit Rate	92%	Most required evidence was recovered in retrieval passes.
Overall Pass Rate	92%	One failure remains tied to keyword mismatch behavior.
Mean Relevance	-4.0948	Current relevance baseline to improve in future iterations.

Key Learnings

Retrieval quality dominates

Better model output depends first on better evidence selection, not only larger models.

Hybrid beats single retrieval

BM25 + dense search improves recall breadth before reranking precision kicks in.

Verification reduces hallucination risk

Claim gating forces synthesis to remain closer to available evidence.

Eval must mirror production

Prompt inconsistency between eval and runtime can hide real regressions.

Silent bugs are dangerous

The BM25 invalidation issue proved that retrieval pipelines can fail quietly.

Persistent memory changes behavior

Saved knowledge compounds agent performance over repeated domain queries.

Where This Goes Next

Current Gaps

No self-correction execution loop yet.
No direct execution feedback cycle.
Cost tracking is incomplete.

Next Upgrade

LangGraph state-machine orchestration.
Execution → diagnose → patch loop.
Tighter runtime observability and control.

Appendix: Run Locally

Quickstart

python -m scripts.ingest
python -m scripts.evaluate
python main.py

Code references: src/knowledge_agent/tools.py, main.py

Repository: knowledge-agent on GitHub →

← Back to Projects