In AI coding, tests become the actual source code. Agents ‘compile’ programs from tests.
The quality of AI generated code depends on feedback. Tests. Linting. Evals. Benchmarks. Insights.
Recent advances in AI coding occurred because of stricter harnesses with richer feedback. Harnesses that guide the LLM to produce correct code through tests. It’s not just the models - its our ability to mold plausible code into correct code.

At extreme levels, we don’t look at code. We instead have opinions about the output’s correctness (ie tests). When agents fail, its because we didn’t capture some requirement. We left out a test.
In this situation, the code becomes a compiled binary from tests, a black box. The tests guide the agent to produce the correct executable.
Tests become like declarative code - specifying more of what than how. We might not write it directly, but its what we pay careful attention to. Did the agent have sufficient feedback to produce good code?
What does this mean? We clearly don’t need tests to generate code.
Indeed, AI produces plausible, 90% correct solution from prompts. It can capture a default definition of “correct” for most functionality. We don’t need to tell it how to do login screen in Rails, for example. But eventually, we might care about some detail.
In a way, few-shotted AI projects become instant legacy code as defined in the classic Feathers book. I don’t mean that in a pejorative sense. More that legacy code defines how we work with it. We’ve inherited it. It’s a black box. We hope to keep it at arms length.
Like any legacy codebase, we might use the strangler pattern to layer in the correctness - testing outside -> in. We might break it apart into testable units. We put in the highest priority tests, gradually moving down (or never visiting) less crucial functionality.
It’s a kind of top-down engineering that assumes a mess. We work to tame it. Or not, and its still a fun demo.