TL;DR
Phase 1 proves that an agent is not magic: it is state + loop + tools. The model plans the next action, your loop executes it, and results are fed back into the message state until write_report completes the run.
The core question behind this build was simple: how do you turn a stateless LLM endpoint into a system that can act, observe outcomes, and iterate?
The answer is mechanical, not mystical. The model has reasoning ability, but your runtime owns control flow, state integrity, tool safety, and termination conditions.
End-to-End Pipeline
main.py sends the query, tool schema, and conversation state.web_search runs via Tavily.Top results + snippets enter state for model review.fetch_page retrieves selected sources in parallel.asyncio.gather + HTML cleanup reduce noise before synthesis.calculate executes numeric derivations safely.AST filtering blocks dangerous expression execution.write_report emits structured markdown report.Run ends with a timestamped artifact in reports/.Architecture Layers
Owns state and loop control: message ordering, API invocation, tool-call detection, stop conditions, and iteration limits.
Owns tool execution: web search, page fetch, safe math evaluation, and markdown report writing through a dispatcher.
Owns environment and client setup to keep credentials and model wiring out of business logic.
Core Mechanics
State = Messages Array
The API is stateless. Every decision depends on what you append and in what order, including tool outputs and assistant turns.
Loop = ReAct Control
Each iteration calls the model, executes requested tools, injects results, then calls again until no more tools are requested.
Tools = External Action
Without tools, the model can only infer. With tools, it can gather data, compute values, and persist structured outputs.
Assumptions
- Model emits valid tool-call JSON.
- Target pages are accessible and parseable.
- Context window can hold fetched evidence.
Failure Modes
- Tool-call mismatch triggers API errors.
- JS-heavy pages return thin content.
- Long runs cause context drift and goal loss.
Loop Skeleton (~30 Lines)
for _ in range(MAX_ROUNDS):
response = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
tools=TOOLS,
)
msg = response.choices[0].message
if not msg.tool_calls:
return msg.content
messages.append(msg)
for tc in msg.tool_calls:
result = execute_tool(
tc.function.name,
json.loads(tc.function.arguments)
)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result
})
This loop is the system. If message sequencing breaks, the agent breaks, even if the model is strong.
Tooling Decisions
web_search (Tavily)
Retrieves ranked candidates quickly so the model can choose which sources deserve full-page fetch.
fetch_page (async)
Uses parallel fetch + HTML cleanup (nav, script, style removal) to maximize useful context density.
calculate (AST safe eval)
Permits arithmetic while blocking unsafe execution vectors that a naive eval() approach would expose.
write_report
Forces structured deliverables: title, summary, key points, and source list, instead of loose response text.
Streaming
Makes reasoning visible in real time; critical for operator trust and live debugging in terminal workflows.
Schema Prompting
Tool descriptions act as behavior contracts. Better schema language improves tool selection consistency.
Challenges, Pivots, and Limits
Blocked / JS-rendered Pages
Raw HTTP fetch cannot execute JavaScript, so some pages degrade into low-signal context. Phase 2 targets this with stronger retrieval strategy.
Context Window Pressure
Large fetched payloads reduce attention on early turns, especially original goals. This causes drift in longer research runs.
No Native Verification
The model can cite sources, but citation is provenance, not truth validation. Cross-source verification is a future upgrade.
Tool Fragility
API ordering rules are strict. Missing assistant/tool pairing or mismatched tool_call_id results in immediate hard failure.
Single-Pass Research
Phase 1 does one search + one fetch wave. Poor initial retrieval quality directly lowers final report quality.
No Caching
Repeat questions rerun the full pipeline. Latency and token cost remain high for repeated investigative themes.
Evaluation Snapshot
| Metric | Result | Interpretation |
|---|---|---|
| Tool Execution Reliability | High | Loop+dispatcher architecture is stable in normal runs. |
| Report Quality | Moderate | Quality rises/falls with source accessibility and content relevance. |
| Latency | ~40–70 seconds | Dominated by model round trips, not local compute. |
| Primary Failure Modes | Blocked pages, context drift | Both map directly to retrieval and memory limits. |
Key Learnings
Agents are loops, not sentience
Remove the loop and the "agent" collapses into a single completion call.
Schema quality drives behavior
Precise tool descriptions work as practical prompt control.
Context is the bottleneck
Long payloads can bury initial task intent unless summarized or retrieved selectively.
External data is noisy
Grounding requires source quality management, not just source collection.
Safety must be explicit
AST constraints and strict dispatcher logic are required guardrails, not optional polish.
Debugging starts with state
Inspecting message sequence explains most incorrect tool behavior faster than output-only inspection.
Phase 1 to Phase 2
Phase 1 Limits
- No persistent memory.
- No retrieval grounding layer.
- No robust source verification loop.
Phase 2 Direction
- Knowledge base + retrieval (RAG).
- Stronger source grounding and fallback behavior.
- Better context discipline under long evidence chains.
Appendix: Run Locally
git clone https://github.com/anudeepreddy332/cli-research-agent.git uv sync uv run python main.py
Code references: main.py, src/research_agent/tools.py, src/research_agent/config.py
Repository: CLI Research Agent on GitHub →
← Back to Learning Log