TL;DR
This is the phase where an agent stops answering questions and starts pursuing a goal. The code-fix agent executes a script, reads the failure, diagnoses the root cause, patches the code, and tries again — until the script runs correctly, the cost ceiling is hit, or it runs out of attempts. The loop is expressed as an explicit LangGraph state machine, not a raw while loop.
The core concept: the agent is no longer answering a question. It is pursuing a goal — and observing the consequences of its own actions to adjust. That's the distinction between a query-response loop and a goal-directed loop.
Code execution is the sharpest use case for this because failure is unambiguous. Either the script runs or it doesn't. Either the output is correct or it isn't. No subjectivity. The exit code is ground truth.
What Phase 3 Adds
Phases 1 & 2
- Raw while loop — control flow implicit in conditionals.
- Query-driven: user asks, agent finds, agent responds.
- One turn per run — loop starts and ends in a single cycle.
- No human checkpoint mid-run.
Phase 3
- LangGraph state machine — every state named, every transition explicit.
- Goal-directed: agent keeps going until goal is met or resources exhausted.
- Reflexion loop — agent reads its own failure history to try a different approach.
- Human-in-the-loop checkpoint with interrupt/resume.
Architecture & State Machine
The agent is a directed graph. Each node is a function. Each edge is a transition — some unconditional, some conditional on the current state. Think of it like a quality control station at a factory: the broken script comes in, gets examined, repaired, approved, tested, and scored. If it passes, it ships. If it fails, it goes back to the repair station with the inspector's notes.
The Files
Contains only two things: the AgentState TypedDict that declares every field the agent tracks, and build_graph() that wires nodes and edges into a compiled LangGraph object. Nothing about how nodes work lives here — only how they connect.
All the logic. Six node functions and three edge functions. Every LLM call, every tool dispatch, every conditional routing decision lives here.
One function: execute_python. Writes code to a temp file, runs it in a subprocess with a hard timeout, captures stdout and stderr, returns a structured result dict. No LLM calls. No graph knowledge.
CLI entry point. Builds initial state, invokes the graph, handles the human approval interrupt loop, prints the final report, and saves a structured run log to disk.
benchmark.py — systematic evaluation across all 20 broken scripts. analyze_failures.py — extracts failed runs from benchmark JSON. regression_check.py — compares results against a stored baseline. test_resilience.py — simulates the four failure modes.
Pipeline Walk-Through: attribute_error.py
Here's what happens when you run a real broken script through the agent, end to end.
class Dog:
def __init__(self, name):
self.name = name
d = Dog("Rex")
print(d.age) # Dog has no attribute 'age'
AttributeError: 'Dog' object has no attribute 'age'
diagnose.
class Dog:
def __init__(self, name):
self.name = name
self.age = None
d = Dog("Rex")
print(d.age)
bypass_hitl=True), auto-approves. In interactive mode, the graph suspends via interrupt(). main.py displays a unified diff and prompts for y/n. On resume, human_approval is set to "approved".
Technical Deep Dive
LangGraph vs. Raw While Loop
A raw while loop is like a chef who keeps the recipe in their head — they know what step comes next because they remember where they are. A LangGraph state machine is like a recipe card with checkboxes: each step is written down, you can see what's done, and anyone can pick up mid-recipe without losing context.
The practical difference: adding human approval to a raw loop means adding more variables, more conditionals, and more mental tracing. In LangGraph, it meant adding two fields to AgentState and one new node. The rest of the graph didn't change.
Typed State and Observability
AgentState is a TypedDict with 18 fields. LangGraph validates at compile time — if a node returns a key not in the schema, it raises immediately, not when that key is accessed at runtime. This turns silent bugs into loud, early errors.
The Reflexion Pattern
On the first failed attempt, node_diagnose receives the current code and error. On subsequent attempts, it receives the full history of prior attempts — each containing the code tried, the error it produced, the diagnosis made, and the patch applied. The prompt changes from "diagnose this error" to "here is everything that was tried and why it failed — what is a different approach?"
Archival timing is critical. attempt_history is archived by node_execute before it overwrites the error fields with new results. If archival happened after the new execution results were written, the wrong error would be saved. This ordering is enforced by the graph structure, not a comment in the code.
Human-in-the-Loop with interrupt/resume
When node_human_approval calls interrupt(payload), the graph stops. The entire state is checkpointed by MemorySaver. Control returns to main.py, which displays a unified diff and prompts the user. When the user responds, main.py calls graph.invoke(Command(resume=decision), config) with the same thread ID — and the graph resumes from the exact point where it suspended.
This matters architecturally: in a production system, the resume signal wouldn't come from the terminal — it would come from a webhook, a UI button, or a message queue. The interrupt()/resume pattern supports all of these without changing the graph.
Sandboxed Code Execution
execute_python writes code to a temp file and runs it with subprocess.run(). This is intentional isolation. If you used Python's exec() built-in, the code would run inside the agent's own process — a script that calls sys.exit() or enters an infinite loop would crash or hang the agent itself. With subprocess isolation, a crash is just a non-zero exit code and a stderr string.
The temp file is deleted in a finally block — even if the subprocess crashes or times out, no orphaned temp files accumulate across runs.
LLM-Based Evaluator
The evaluator is what separates "the code runs" from "the code is correct." Exit code 0 means no exception was raised. It doesn't mean the output is right. logic_error.py proved this: the inverted Fahrenheit-to-Celsius formula ran cleanly and exited 0, but produced 180.0 for 212°F instead of the correct 100.0.
When the evaluator scores below 7, it doesn't just fail the run — it sends the agent back to node_diagnose with the evaluator's specific feedback. The agent now knows not just that its fix was wrong, but precisely why.
Cost Tracking and Guards
Cost Ceiling ($0.10/run)
Checked first in should_continue before anything else. Money gates go first — even a successful fix on the ceiling-crossing iteration terminates as cost_exceeded.
Max Iterations (5)
Checked after the cost gate. Prevents runaway cost on cases where the agent consistently makes the wrong diagnosis or the problem is structurally unfixable.
Execution Timeout (10s)
Subprocess is killed after 10 seconds via SIGKILL. The agent receives a synthetic timed_out=True result with a readable error string — it correctly diagnosed infinite_loop.py from this synthetic message.
Input Validation
Non-empty code check before writing to disk. JSON parse failure in node_evaluate defaults to score=3. Missing code fences in node_patch fall back to the raw response.
Benchmark Results — 20 Scripts
The final run across all 20 broken scripts. bypass_hitl=True auto-approved every patch, making the run non-interactive and reproducible.
| Script | Status | Iter | Score | Cost | Time |
|---|---|---|---|---|---|
| attribute_error.py | done | 2 | 9/10 | $0.0004 | 4.4s |
| dict_mutation.py | done | 2 | 10/10 | $0.0003 | 4.2s |
| encoding_error.py | done | 2 | 10/10 | $0.0004 | 4.6s |
| file_not_found.py | done | 2 | 10/10 | $0.0004 | 3.6s |
| import_error.py | not fixed | 5 | 2/10 | $0.0029 | 24.3s |
| index_error.py | done | 2 | 10/10 | $0.0003 | 3.4s |
| infinite_loop.py | done | 2 | 10/10 | $0.0003 | 14.3s |
| key_error.py | done | 2 | 9/10 | $0.0003 | 4.2s |
| logic_error.py | done | 2 | 10/10 | $0.0007 | 7.2s |
| none_type_error.py | done | 2 | 10/10 | $0.0003 | 3.9s |
| off_by_one.py | done | 2 | 10/10 | $0.0004 | 4.7s |
| recursion_error.py | done | 3 | 9/10 | $0.0012 | 10.0s |
| string_format_error.py | done | 2 | 10/10 | $0.0004 | 4.6s |
| syntax_error.py | done | 2 | 10/10 | $0.0003 | 3.9s |
| tricky_error.py | done | 2 | 9/10 | $0.0005 | 14.6s |
| type_error.py | done | 2 | 10/10 | $0.0004 | 6.3s |
| unbound_local.py | done | 2 | 10/10 | $0.0004 | 4.8s |
| value_error.py | done | 2 | 10/10 | $0.0003 | 3.8s |
| wrong_return.py | done | 4 | 10/10 | $0.0020 | 19.5s |
| zero_division.py | done | 2 | 9/10 | $0.0004 | 4.9s |
The one failure is a sandbox constraint, not an agent failure. import_error.py tries to import numpy and pandas, which aren't installed in the agent's environment. The agent correctly identified the missing modules — but every attempted fix either renamed the imports, swapped to other uninstalled libraries, or rewrote the functionality in ways the evaluator scored as losing the original intent. The fix rate of 95% is a precise claim: this agent, these 20 scripts, this environment, these parameters.
Resilience Testing
Execution Timeout
Agent is given while True: pass. Times out after 10 seconds, receives the synthetic timeout error string, diagnoses an infinite loop, and either fixes it or blocks at MAX_ITERATIONS. A crash is not acceptable. All four tests pass.
Cost Ceiling
Agent is given normal code but with total_cost_usd pre-set to 999.0. The first call to should_continue sees cost above ceiling and routes to "end" — before any LLM call is made. Status: cost_exceeded.
Max Iterations
Agent is given broken syntax with iterations pre-set to MAX_ITERATIONS. The first should_continue check routes to "end". Status: "blocked". No runaway.
Invalid API Key
DEEPSEEK_API_KEY is temporarily replaced with garbage. The test asserts the final status is not "done" — either the agent raises an exception or returns "blocked". The original key is restored in a finally block.
Guardrails & Known Limits
Known Limitations
No pip access in sandbox
Any script requiring an uninstalled library cannot be fixed by installing it. The agent will exhaust iterations trying to rewrite without the library — often the wrong approach.
Evaluator relies on intent inference
Without a ground truth expected_output, the evaluator infers intent from the original code. Usually correct, not guaranteed. Supplying expected output would improve accuracy for logic errors.
No rejection cap
A human can reject every patch indefinitely. A production system should add a rejection_count field to state and terminate with "blocked" after N rejections.
The Typo Bug
A typo crept into the bypass_hitl field name — written as bypass_hit1 (digit one) in one location while referenced as bypass_hitl (lowercase L) elsewhere. In benchmark mode, the auto-approve path was never triggered — the graph reached interrupt() and waited for human input that was never coming. The bug was caught when the benchmark hung without output. Typos in dict keys are silent killers. LangGraph validates node return values, not key access within node functions. The only defenses are careful review, consistent naming conventions, and integration tests that exercise the bypass path.
Key Learnings
Explicit state is not a luxury
Externalizing agent state into a typed schema made debugging fast and extension cheap. Adding human approval required two new fields and one new node — nothing else changed.
exit_code 0 ≠ correctness
The evaluator is the most important addition. Any agent that treats execution success as correctness will produce confidently wrong results. One more LLM call per success — it's the call that makes the system trustworthy.
Reflexion requires designed memory
Getting the archival timing right — saving the previous attempt before overwriting current error fields — required explicit reasoning about node order. The graph structure enforces it.
Instrumentation is not optional
LangSmith tracing immediately showed that node_diagnose and node_evaluate have the highest latency (2–4s each). Without tracing, this would have required manual timing instrumentation.
Sandbox design shapes what you can fix
The 5% failure rate is entirely explained by the sandbox design choice: no pip access. Understanding where your agent fails — and whether those are agent bugs or environmental constraints — requires structured logging.
Typos in dict keys are silent
No error. No stack trace. Just the wrong branch taken silently. The only defenses are code review, consistent conventions, and integration tests that exercise every path.
Appendix: Run Locally
git clone https://github.com/anudeepreddy332/code-agent.git uv sync uv run python main.py scripts/broken_scripts/attribute_error.py
Core files: graph.py, nodes.py, tools.py, main.py
Repository: code-agent on GitHub →
← Back to Projects