Code-Fix Agent — Phase 3 Deep Dive

TL;DR

This is the phase where an agent stops answering questions and starts pursuing a goal. The code-fix agent executes a script, reads the failure, diagnoses the root cause, patches the code, and tries again — until the script runs correctly, the cost ceiling is hit, or it runs out of attempts. The loop is expressed as an explicit LangGraph state machine, not a raw while loop.

Fix Rate

95%

19 of 20 test scripts fixed correctly

Mean Iterations

2.3

Most scripts fixed in one diagnose-patch cycle

Total Run Cost

$0.0153

Across all 20 scripts via DeepSeek API

Scripts Run

0 crashed, 0 hit cost ceiling, 0 hung

The core concept: the agent is no longer answering a question. It is pursuing a goal — and observing the consequences of its own actions to adjust. That's the distinction between a query-response loop and a goal-directed loop.

Code execution is the sharpest use case for this because failure is unambiguous. Either the script runs or it doesn't. Either the output is correct or it isn't. No subjectivity. The exit code is ground truth.

What Phase 3 Adds

Phases 1 & 2

Raw while loop — control flow implicit in conditionals.
Query-driven: user asks, agent finds, agent responds.
One turn per run — loop starts and ends in a single cycle.
No human checkpoint mid-run.

Phase 3

LangGraph state machine — every state named, every transition explicit.
Goal-directed: agent keeps going until goal is met or resources exhausted.
Reflexion loop — agent reads its own failure history to try a different approach.
Human-in-the-loop checkpoint with interrupt/resume.

Architecture & State Machine

The agent is a directed graph. Each node is a function. Each edge is a transition — some unconditional, some conditional on the current state. Think of it like a quality control station at a factory: the broken script comes in, gets examined, repaired, approved, tested, and scored. If it passes, it ships. If it fails, it goes back to the repair station with the inspector's notes.

State Machine — Execution Flow

The Files

Core graph.py

Contains only two things: the AgentState TypedDict that declares every field the agent tracks, and build_graph() that wires nodes and edges into a compiled LangGraph object. Nothing about how nodes work lives here — only how they connect.

Core nodes.py

All the logic. Six node functions and three edge functions. Every LLM call, every tool dispatch, every conditional routing decision lives here.

Core tools.py

One function: execute_python. Writes code to a temp file, runs it in a subprocess with a hard timeout, captures stdout and stderr, returns a structured result dict. No LLM calls. No graph knowledge.

Entry main.py

CLI entry point. Builds initial state, invokes the graph, handles the human approval interrupt loop, prints the final report, and saves a structured run log to disk.

Scripts scripts/

benchmark.py — systematic evaluation across all 20 broken scripts. analyze_failures.py — extracts failed runs from benchmark JSON. regression_check.py — compares results against a stored baseline. test_resilience.py — simulates the four failure modes.

Pipeline Walk-Through: attribute_error.py

Here's what happens when you run a real broken script through the agent, end to end.

The broken script — attribute_error.py

class Dog:
    def __init__(self, name):
        self.name = name

d = Dog("Rex")
print(d.age)   # Dog has no attribute 'age'

Execution Walk-Through

execute (iteration 1) Script is written to a temp file and run in a subprocess. Exit code 1. stderr contains the traceback. State: exit_code=1 · error=full traceback · iterations=1 · attempt_history=[]

stderr output

AttributeError: 'Dog' object has no attribute 'age'

│

should_continue Cost ceiling not exceeded. Exit code ≠ 0. Iterations (1) < MAX (5). Routes to diagnose.

│

diagnose LLM receives the broken code and traceback. Asked to identify root cause in 2–3 sentences without suggesting a fix. State: diagnosis=LLM explanation · total_cost_usd incremented · total_tokens incremented

│

patch LLM receives broken code + error + diagnosis. Instructed to return corrected code inside a code fence only.

Patched code

class Dog:
    def __init__(self, name):
        self.name = name
        self.age = None

d = Dog("Rex")
print(d.age)

│

human_approval In benchmark mode (bypass_hitl=True), auto-approves. In interactive mode, the graph suspends via interrupt(). main.py displays a unified diff and prompts for y/n. On resume, human_approval is set to "approved".

│

execute (iteration 2) Patched code runs. Exit code 0. stdout: "None" (the default value of self.age). State: exit_code=0 · stdout="None" · error=None · iterations=2

│

evaluate LLM receives original code, fixed code, and actual output. Asked to score correctness 1–10 with JSON reasoning. Score: 9. Passes threshold of 7. Status set to "done". Score 9 — the AttributeError is resolved, default value of None is reasonable, but not the only correct choice.

│

end Benchmark record: status=done · fixed=true · iterations=2 · score=9 · cost=$0.00037 · elapsed=4.42s

Technical Deep Dive

LangGraph vs. Raw While Loop

A raw while loop is like a chef who keeps the recipe in their head — they know what step comes next because they remember where they are. A LangGraph state machine is like a recipe card with checkboxes: each step is written down, you can see what's done, and anyone can pick up mid-recipe without losing context.

The practical difference: adding human approval to a raw loop means adding more variables, more conditionals, and more mental tracing. In LangGraph, it meant adding two fields to AgentState and one new node. The rest of the graph didn't change.

Typed State and Observability

AgentState is a TypedDict with 18 fields. LangGraph validates at compile time — if a node returns a key not in the schema, it raises immediately, not when that key is accessed at runtime. This turns silent bugs into loud, early errors.

Key AgentState Fields

code exit_code stdout error diagnosis patch_explanation human_approval evaluator_score status iterations attempt_history total_cost_usd total_tokens bypass_hitl

The Reflexion Pattern

On the first failed attempt, node_diagnose receives the current code and error. On subsequent attempts, it receives the full history of prior attempts — each containing the code tried, the error it produced, the diagnosis made, and the patch applied. The prompt changes from "diagnose this error" to "here is everything that was tried and why it failed — what is a different approach?"

Archival timing is critical. attempt_history is archived by node_execute before it overwrites the error fields with new results. If archival happened after the new execution results were written, the wrong error would be saved. This ordering is enforced by the graph structure, not a comment in the code.

Human-in-the-Loop with interrupt/resume

When node_human_approval calls interrupt(payload), the graph stops. The entire state is checkpointed by MemorySaver. Control returns to main.py, which displays a unified diff and prompts the user. When the user responds, main.py calls graph.invoke(Command(resume=decision), config) with the same thread ID — and the graph resumes from the exact point where it suspended.

This matters architecturally: in a production system, the resume signal wouldn't come from the terminal — it would come from a webhook, a UI button, or a message queue. The interrupt()/resume pattern supports all of these without changing the graph.

Sandboxed Code Execution

execute_python writes code to a temp file and runs it with subprocess.run(). This is intentional isolation. If you used Python's exec() built-in, the code would run inside the agent's own process — a script that calls sys.exit() or enters an infinite loop would crash or hang the agent itself. With subprocess isolation, a crash is just a non-zero exit code and a stderr string.

The temp file is deleted in a finally block — even if the subprocess crashes or times out, no orphaned temp files accumulate across runs.

LLM-Based Evaluator

The evaluator is what separates "the code runs" from "the code is correct." Exit code 0 means no exception was raised. It doesn't mean the output is right. logic_error.py proved this: the inverted Fahrenheit-to-Celsius formula ran cleanly and exited 0, but produced 180.0 for 212°F instead of the correct 100.0.

When the evaluator scores below 7, it doesn't just fail the run — it sends the agent back to node_diagnose with the evaluator's specific feedback. The agent now knows not just that its fix was wrong, but precisely why.

Cost Tracking and Guards

Cost Ceiling ($0.10/run)

Checked first in should_continue before anything else. Money gates go first — even a successful fix on the ceiling-crossing iteration terminates as cost_exceeded.

Max Iterations (5)

Checked after the cost gate. Prevents runaway cost on cases where the agent consistently makes the wrong diagnosis or the problem is structurally unfixable.

Execution Timeout (10s)

Subprocess is killed after 10 seconds via SIGKILL. The agent receives a synthetic timed_out=True result with a readable error string — it correctly diagnosed infinite_loop.py from this synthetic message.

Input Validation

Non-empty code check before writing to disk. JSON parse failure in node_evaluate defaults to score=3. Missing code fences in node_patch fall back to the raw response.

Benchmark Results — 20 Scripts

The final run across all 20 broken scripts. bypass_hitl=True auto-approved every patch, making the run non-interactive and reproducible.

Script	Status	Iter	Score	Cost	Time
attribute_error.py	done	2	9/10	$0.0004	4.4s
dict_mutation.py	done	2	10/10	$0.0003	4.2s
encoding_error.py	done	2	10/10	$0.0004	4.6s
file_not_found.py	done	2	10/10	$0.0004	3.6s
import_error.py	not fixed	5	2/10	$0.0029	24.3s
index_error.py	done	2	10/10	$0.0003	3.4s
infinite_loop.py	done	2	10/10	$0.0003	14.3s
key_error.py	done	2	9/10	$0.0003	4.2s
logic_error.py	done	2	10/10	$0.0007	7.2s
none_type_error.py	done	2	10/10	$0.0003	3.9s
off_by_one.py	done	2	10/10	$0.0004	4.7s
recursion_error.py	done	3	9/10	$0.0012	10.0s
string_format_error.py	done	2	10/10	$0.0004	4.6s
syntax_error.py	done	2	10/10	$0.0003	3.9s
tricky_error.py	done	2	9/10	$0.0005	14.6s
type_error.py	done	2	10/10	$0.0004	6.3s
unbound_local.py	done	2	10/10	$0.0004	4.8s
value_error.py	done	2	10/10	$0.0003	3.8s
wrong_return.py	done	4	10/10	$0.0020	19.5s
zero_division.py	done	2	9/10	$0.0004	4.9s

The one failure is a sandbox constraint, not an agent failure. import_error.py tries to import numpy and pandas, which aren't installed in the agent's environment. The agent correctly identified the missing modules — but every attempted fix either renamed the imports, swapped to other uninstalled libraries, or rewrote the functionality in ways the evaluator scored as losing the original intent. The fix rate of 95% is a precise claim: this agent, these 20 scripts, this environment, these parameters.

Resilience Testing

Execution Timeout

Agent is given while True: pass. Times out after 10 seconds, receives the synthetic timeout error string, diagnoses an infinite loop, and either fixes it or blocks at MAX_ITERATIONS. A crash is not acceptable. All four tests pass.

Cost Ceiling

Agent is given normal code but with total_cost_usd pre-set to 999.0. The first call to should_continue sees cost above ceiling and routes to "end" — before any LLM call is made. Status: cost_exceeded.

Max Iterations

Agent is given broken syntax with iterations pre-set to MAX_ITERATIONS. The first should_continue check routes to "end". Status: "blocked". No runaway.

Invalid API Key

DEEPSEEK_API_KEY is temporarily replaced with garbage. The test asserts the final status is not "done" — either the agent raises an exception or returns "blocked". The original key is restored in a finally block.

Guardrails & Known Limits

Known Limitations

No pip access in sandbox

Any script requiring an uninstalled library cannot be fixed by installing it. The agent will exhaust iterations trying to rewrite without the library — often the wrong approach.

Evaluator relies on intent inference

Without a ground truth expected_output, the evaluator infers intent from the original code. Usually correct, not guaranteed. Supplying expected output would improve accuracy for logic errors.

No rejection cap

A human can reject every patch indefinitely. A production system should add a rejection_count field to state and terminate with "blocked" after N rejections.

The Typo Bug

A typo crept into the bypass_hitl field name — written as bypass_hit1 (digit one) in one location while referenced as bypass_hitl (lowercase L) elsewhere. In benchmark mode, the auto-approve path was never triggered — the graph reached interrupt() and waited for human input that was never coming. The bug was caught when the benchmark hung without output. Typos in dict keys are silent killers. LangGraph validates node return values, not key access within node functions. The only defenses are careful review, consistent naming conventions, and integration tests that exercise the bypass path.

Key Learnings

Explicit state is not a luxury

Externalizing agent state into a typed schema made debugging fast and extension cheap. Adding human approval required two new fields and one new node — nothing else changed.

exit_code 0 ≠ correctness

The evaluator is the most important addition. Any agent that treats execution success as correctness will produce confidently wrong results. One more LLM call per success — it's the call that makes the system trustworthy.

Reflexion requires designed memory

Getting the archival timing right — saving the previous attempt before overwriting current error fields — required explicit reasoning about node order. The graph structure enforces it.

Instrumentation is not optional

LangSmith tracing immediately showed that node_diagnose and node_evaluate have the highest latency (2–4s each). Without tracing, this would have required manual timing instrumentation.

Sandbox design shapes what you can fix

The 5% failure rate is entirely explained by the sandbox design choice: no pip access. Understanding where your agent fails — and whether those are agent bugs or environmental constraints — requires structured logging.

Typos in dict keys are silent

No error. No stack trace. Just the wrong branch taken silently. The only defenses are code review, consistent conventions, and integration tests that exercise every path.

Appendix: Run Locally

Quickstart

git clone https://github.com/anudeepreddy332/code-agent.git
uv sync
uv run python main.py scripts/broken_scripts/attribute_error.py

Core files: graph.py, nodes.py, tools.py, main.py

Repository: code-agent on GitHub →

← Back to Projects