John Haigh's Blog: The Anatomy Of An AI Coding Agent, Part 6

# The Anatomy Of An AI Coding Agent, Part 6

## The Feedback Loop: Verification, Evaluation, And Learning

AI coding agents do not become useful because they can generate code. They become useful when they can recover from being wrong.

That distinction matters. A code suggestion tool can autocomplete a function. An agent is expected to take a goal, inspect a codebase, make changes, run checks, interpret failures, adjust its approach, and keep moving.

The difference is not raw generation. It is the feedback loop: verification, evaluation, and learning.

For software engineers and technical leaders evaluating tools like Cursor, Claude Code, Codex CLI, and similar systems, this loop is where much of the practical value lives. It is also where many failures hide.

## Verification: Did The Change Actually Work?

The first layer of feedback is verification. This is the ordinary engineering question: did the thing we changed behave correctly?

For an AI coding agent, verification usually means using the same signals a human engineer would use:

- Unit tests.

- Integration tests.

- Type checks.

- Linters.

- Build steps.

- Runtime errors.

- Browser checks.

- API responses.

- Logs.

- Diff review.

A weak agent treats code generation as the endpoint. A stronger agent treats generation as a hypothesis. It proposes a change, then looks for evidence.

For example, imagine asking an agent to fix a bug where a React form submits twice when the user presses Enter. A shallow agent may add a debounce and stop. A better agent will inspect the form handler, notice both `onSubmit` and `onKeyDown` paths trigger the same action, remove the duplicate path, and run the relevant test suite.

If tests fail because an existing assertion expected the old event behavior, the agent should determine whether the assertion is now wrong or whether the implementation broke something else.

Running tests is not enough. The agent has to understand what the result means.

Verification should also be scoped. Running every test in a monorepo after changing one helper function may be expensive. Running only a single unit test after modifying shared authentication middleware may be dangerously narrow. Good agents develop a sense of blast radius: what changed, what depends on it, and what evidence is proportional.

## Evaluation: Was This The Right Change?

Verification asks whether the change works. Evaluation asks whether it was a good change.

This is where agents need judgment, not just tooling. Code can pass tests and still be wrong for the system.

Consider a request:

```text

Make the export job faster.

```

An agent might parallelize database reads and pass the test suite. But evaluation should ask deeper questions:

- Does this overload the database?

- Does it preserve ordering guarantees?

- Are there rate limits or tenant isolation constraints?

- Does the existing system already have a queue or batching abstraction?

- Is the performance gain measured or assumed?

A practical agent should be able to say: "The change is likely correct, but I do not see a benchmark or production-like test covering the intended performance improvement."

That kind of answer is valuable because it separates confidence from evidence.

Evaluation also includes maintainability. If an agent solves a problem by adding a clever abstraction that no one asked for, the change may be technically valid and still undesirable. In mature codebases, the best solution is often the one that fits the local style.

If a Go service consistently handles validation through a central `Validate()` method, an agent should not introduce a new validation library for one endpoint. If a frontend app uses React Query everywhere, the agent should not hand-roll `fetch` state for a new screen.

Evaluation is partly about taste, but not arbitrary taste. It is the discipline of respecting the system already in front of you.

## Learning: What Carries Forward?

The word learning can be misleading. Most coding agents do not continuously retrain themselves on your codebase after every task. They do, however, learn within a session through context.

They learn that a test failed because a fixture was incomplete. They learn that a repository uses generated mocks. They learn that a package has strict lint rules. They learn that the user prefers small, reviewable diffs. They learn that a migration tool must be run after changing a schema.

This short-term learning is powerful when the agent uses it well.

Suppose an agent adds a new API field, then sees a failing test because generated OpenAPI types are stale. A poor agent might manually patch the generated file. A better agent will infer the proper workflow: update the source schema, run the generator, then verify the generated output. If a similar issue appears later in the same task, the agent should not rediscover the process from scratch.

Teams can also create longer-term learning through rules, documentation, examples, and review feedback. This is less glamorous than model training, but often more effective.

A short `CONTRIBUTING.md` that explains how to run focused tests may improve agent behavior more than a vague instruction to "write high-quality code."

The best guidance is concrete:

```text

Use make test-unit for backend-only changes.

```

```text

Do not edit generated files directly. Update the schema and run codegen.

```

```text

New authorization checks require tests for admin, editor, and viewer roles.

```

These instructions turn tribal knowledge into usable feedback.

## The Human Role In The Loop

A good feedback loop does not remove humans. It changes where humans spend attention.

Instead of writing every line, the engineer reviews intent, constraints, and evidence. Did the agent understand the request? Did it inspect the right files? Did it run the right checks? Did it explain residual risk honestly?

Technical leaders should evaluate agents by watching this behavior, not just by comparing demos. A tool that produces impressive first drafts but cannot interpret failures will slow down senior engineers. A tool that makes smaller changes, verifies them carefully, and reports uncertainty clearly may be more valuable in real production work.

One useful evaluation exercise is to give agents tasks with known traps:

- A bug with an obvious but incorrect fix.

- A test failure caused by stale generated code.

- A change that requires updating documentation and types.

- A security-sensitive path where passing tests are not enough.

- A flaky integration test that should not be blindly fixed by weakening assertions.

The question is not whether the agent gets everything right immediately. The question is whether it responds intelligently to feedback.

## Failure Is Information

Imagine an agent changes a billing calculation from rounding each line item to rounding only the final invoice total. It updates the implementation and runs tests. One test fails:

```text

Expected: $10.02

Actual: $10.01

```

A generation-only tool might change the expected value and move on. A feedback-driven agent should pause. Is the test documenting the old bug, or is the new behavior wrong? It should inspect the product requirement, nearby tests, and comments. It might discover that tax rules require rounding per jurisdiction, not per invoice. The failing test was protecting a real constraint.

The feedback loop prevented a regression.

This is the core pattern. Failure is not noise. Failure is information.

## Conclusion

AI coding agents will keep getting better at writing code. But for serious engineering work, code generation is only one part of the system.

The real anatomy of a useful agent includes a loop: make a change, verify it, evaluate the result, learn from the feedback, and adjust. That loop is what turns an agent from a fast typist into a practical collaborator.

For teams adopting these tools, the goal should not be blind automation. It should be evidence-driven assistance. The agent should show its work, use the project's existing signals, respect local conventions, and be honest about what remains uncertain.

Trust does not come from confident output. It comes from a process that can find mistakes before they reach production.

John Haigh's Blog

Thursday, May 14, 2026

The Anatomy Of An AI Coding Agent, Part 6

No comments:

Post a Comment

About Me