CLI Research Agent | The Machinist

TL;DR

This project proves that an agent is not magic: it is state + loop + tools. The model plans the next action, your loop executes it, and results are fed back into the message state until write_report completes the run.

Latency

40–70s

Observed end-to-end runtime

Execution Reliability

High

Loop + dispatcher remained stable

Output Quality

Moderate

Limited by page quality and context budget

Top Failure Modes

Blocked pages + context drift

The core question behind this build was simple: how do you turn a stateless LLM endpoint into a system that can act, observe outcomes, and iterate?

The answer is mechanical, not mystical. The model has reasoning ability, but your runtime owns control flow, state integrity, tool safety, and termination conditions.

End-to-End Pipeline

Execution Flow

User query enters the loop.main.py sends the query, tool schema, and conversation state.

│

Model returns either plain text or tool calls.If tool calls are present, the loop must execute them before the next model turn.

│

web_search runs via Tavily.Top results + snippets enter state for model review.

│

fetch_page retrieves selected sources in parallel.asyncio.gather + HTML cleanup reduce noise before synthesis.

│

Optional calculate executes numeric derivations safely.AST filtering blocks dangerous expression execution.

│

write_report emits structured markdown report.Run ends with a timestamped artifact in reports/.

Architecture Layers

Layer 1 main.py

Owns state and loop control: message ordering, API invocation, tool-call detection, stop conditions, and iteration limits.

Layer 2 src/research_agent/tools.py

Owns tool execution: web search, page fetch, safe math evaluation, and markdown report writing through a dispatcher.

Layer 3 src/research_agent/config.py

Owns environment and client setup to keep credentials and model wiring out of business logic.

Core Mechanics

State = Messages Array

The API is stateless. Every decision depends on what you append and in what order, including tool outputs and assistant turns.

Loop = ReAct Control

Each iteration calls the model, executes requested tools, injects results, then calls again until no more tools are requested.

Tools = External Action

Without tools, the model can only infer. With tools, it can gather data, compute values, and persist structured outputs.

Assumptions

Model emits valid tool-call JSON.
Target pages are accessible and parseable.
Context window can hold fetched evidence.

Failure Modes

Tool-call mismatch triggers API errors.
JS-heavy pages return thin content.
Long runs cause context drift and goal loss.

Loop Skeleton (~30 Lines)

main.py agent loop

for _ in range(MAX_ROUNDS):
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        tools=TOOLS,
    )

    msg = response.choices[0].message

    if not msg.tool_calls:
        return msg.content

    messages.append(msg)

    for tc in msg.tool_calls:
        result = execute_tool(
            tc.function.name,
            json.loads(tc.function.arguments)
        )
        messages.append({
            "role": "tool",
            "tool_call_id": tc.id,
            "content": result
        })

This loop is the system. If message sequencing breaks, the agent breaks, even if the model is strong.

Tooling Decisions

`web_search` (Tavily)

Retrieves ranked candidates quickly so the model can choose which sources deserve full-page fetch.

`fetch_page` (async)

Uses parallel fetch + HTML cleanup (nav, script, style removal) to maximize useful context density.

`calculate` (AST safe eval)

Permits arithmetic while blocking unsafe execution vectors that a naive eval() approach would expose.

`write_report`

Forces structured deliverables: title, summary, key points, and source list, instead of loose response text.

Streaming

Makes reasoning visible in real time; critical for operator trust and live debugging in terminal workflows.

Schema Prompting

Tool descriptions act as behavior contracts. Better schema language improves tool selection consistency.

Challenges, Pivots, and Limits

Blocked / JS-rendered Pages

Raw HTTP fetch cannot execute JavaScript, so some pages degrade into low-signal context. The next agent targets this with stronger retrieval strategy.

Context Window Pressure

Large fetched payloads reduce attention on early turns, especially original goals. This causes drift in longer research runs.

No Native Verification

The model can cite sources, but citation is provenance, not truth validation. Cross-source verification is a future upgrade.

Tool Fragility

API ordering rules are strict. Missing assistant/tool pairing or mismatched tool_call_id results in immediate hard failure.

Single-Pass Research

This agent does one search + one fetch wave. Poor initial retrieval quality directly lowers final report quality.

No Caching

Repeat questions rerun the full pipeline. Latency and token cost remain high for repeated investigative themes.

Evaluation Snapshot

Metric	Result	Interpretation
Tool Execution Reliability	High	Loop+dispatcher architecture is stable in normal runs.
Report Quality	Moderate	Quality rises/falls with source accessibility and content relevance.
Latency	~40–70 seconds	Dominated by model round trips, not local compute.
Primary Failure Modes	Blocked pages, context drift	Both map directly to retrieval and memory limits.

Key Learnings

Agents are loops, not sentience

Remove the loop and the "agent" collapses into a single completion call.

Schema quality drives behavior

Precise tool descriptions work as practical prompt control.

Context is the bottleneck

Long payloads can bury initial task intent unless summarized or retrieved selectively.

External data is noisy

Grounding requires source quality management, not just source collection.

Safety must be explicit

AST constraints and strict dispatcher logic are required guardrails, not optional polish.

Debugging starts with state

Inspecting message sequence explains most incorrect tool behavior faster than output-only inspection.

Where This Goes Next

Current Limits

No persistent memory.
No retrieval grounding layer.
No robust source verification loop.

Next Direction

Knowledge base + retrieval (RAG).
Stronger source grounding and fallback behavior.
Better context discipline under long evidence chains.

Appendix: Run Locally

Quickstart

git clone https://github.com/anudeepreddy332/cli-research-agent.git
uv sync
uv run python main.py

Code references: main.py, src/research_agent/tools.py, src/research_agent/config.py

Repository: CLI Research Agent on GitHub →

← Back to Projects