TL;DR
Phase 1 could research, but it could not remember. Phase 2 introduces persistent knowledge and hybrid retrieval, turning one-shot web reasoning into a reusable memory system with improved routing accuracy and higher content reliability.
The design goal was not just better answers, but better system behavior: reusable knowledge, explicit retrieval logic, and verifiable claims under noisy inputs.
Architecture Flow
query_knowledge_base / web_searchverify_claimsave_to_knowledge_baseCore Mechanics
Store
Chunked ingestion persists useful evidence so future runs can reuse context instead of re-fetching everything from the web.
Retrieve + Rerank
Dense retrieval and BM25 broaden recall; RRF and cross-encoder reranking sharpen final relevance.
Verify + Generate
Claim checks reduce unsupported synthesis and force tighter alignment between generated responses and retrieved evidence.
Technical Deep Dive
Embeddings (Dense Retrieval)
Similarity is computed through cosine distance, enabling semantic retrieval beyond exact keyword overlap.
cosine_similarity = (A · B) / (||A|| ||B||)
Hybrid Search (Dense + BM25)
RRF blends lexical and semantic retrieval so exact-match and meaning-match evidence can both survive the first cut.
score = 1/(rank_dense + 60) + 1/(rank_bm25 + 60)
Ingestion Pipeline
Paragraph-first splitting preserves semantic coherence, with fallback chunking when source structure is weak.
Cross-Encoder Reranking
Query and candidate text are scored jointly, improving precision over independent bi-encoder scoring.
Dual-Tool Routing
Strong KB match uses local memory. Weak match falls back to live web search to prevent brittle local-only behavior.
Concrete Retrieval Example
Claim Verification Layer
verify_claim Logic
Claim-to-evidence similarity threshold gates support status.
if cosine_similarity >= 0.65:
status = "SUPPORTED"
Observed Behavior
Example: claim/chunk similarity of 0.71 is accepted as supported, reducing weakly grounded response risk.
Challenges, Failures, and Pivots
Routing Failures
Resolved by explicit prompt routing rules and clearer dispatch boundaries for KB vs web tools.
Context Poisoning
Mitigated through cross-encoder reranking to demote noisy high-recall chunks.
BM25 Invalidation Bug
Fixed by rebuilding BM25 index independently, eliminating stale lexical scoring artifacts.
Evaluation Mismatch
Resolved by sharing a consistent SYSTEM_PROMPT between runtime and evaluation harness.
Evaluation Results
| Metric | Result | Interpretation |
|---|---|---|
| Tool Routing Accuracy | 100% | Routing policy is currently robust under tested prompts. |
| Content Hit Rate | 92% | Most required evidence was recovered in retrieval passes. |
| Overall Pass Rate | 92% | One failure remains tied to keyword mismatch behavior. |
| Mean Relevance | -4.0948 | Current relevance baseline to improve in future iterations. |
Key Learnings
Retrieval quality dominates
Better model output depends first on better evidence selection, not only larger models.
Hybrid beats single retrieval
BM25 + dense search improves recall breadth before reranking precision kicks in.
Verification reduces hallucination risk
Claim gating forces synthesis to remain closer to available evidence.
Eval must mirror production
Prompt inconsistency between eval and runtime can hide real regressions.
Silent bugs are dangerous
The BM25 invalidation issue proved that retrieval pipelines can fail quietly.
Persistent memory changes behavior
Saved knowledge compounds agent performance over repeated domain queries.
Phase 3 Direction
Current Gaps
- No self-correction execution loop yet.
- No direct execution feedback cycle.
- Cost tracking is incomplete.
Next Upgrade
- LangGraph state-machine orchestration.
- Execution → diagnose → patch loop.
- Tighter runtime observability and control.
Appendix: Run Locally
python -m scripts.ingest python -m scripts.evaluate python main.py
Code references: src/knowledge_agent/tools.py, main.py
Repository: knowledge-agent on GitHub →
← Back to Learning Log