There Is More to Software Than Just Code

I’ve been building enterprise software for over twenty years. These days, I rarely write code myself. Most of what I build — web applications backed by databases, serving multiple users, handling authentication, enforcing business rules — is structurally predictable. The components are well-known. Most of the code is glue between the database and the user. That kind of work is predictable enough to delegate.

So I deliver software through AI agents now. Scala 3, ZIO, HTMX. The agents handle analysis, task breakdown, implementation, code review, testing, deployment scripts. My job is knowing what to build, how to structure it, and what “done” means.

What I focus on now is the engineering around the code: architecture, testing, security, operations. Those are what turn generated code into software you can actually maintain and run.

Architecture and AI

Clean architecture is easier to maintain with agents than with human teams.

In my experience, developers readily see the benefit of patterns like domain-driven design or hexagonal architecture. Implementing them well requires discipline — and maintaining discipline across a group of humans is hard. People bring different context, different prior knowledge. They’re under stress and cut corners. They interpret guidelines differently. Consistency erodes over time.

AI agents don’t have that problem. Give them clear rules and boundaries, and they follow them. The code I get back is more consistent than what I’ve seen from teams — including teams I’ve led. They’ll even make reasonable architectural choices on their own — the issue is that without explicit direction, they’ll make different reasonable choices each time. So I choose the approach upfront, and the agents apply it consistently across the project.

I’ve settled on three patterns: domain-driven design, hexagonal architecture, and a functional core with imperative shell. They work particularly well for AI-driven development because they create clear boundaries. The agent knows exactly where domain logic goes, where infrastructure lives, where the interfaces are. Fewer judgment calls means fewer mistakes.

And the AI is genuinely good at the design work itself. Prompt it specifically to define bounded contexts, map domain relationships, and propose a model — the results are solid. I adjust based on context it doesn’t have, give feedback where my understanding of the problem differs. The baseline quality is higher than I expected.

Testing is the verification loop

An AI agent without tests is no better than raw inference — guessing from training data the same way a human writes from memory. What makes agents genuinely useful is that they can verify their results. Tests are the tool for that. Write them first, ideally TDD-style, and the agent has something to check against. It sees what it got wrong and fixes it. That feedback loop is what makes it work.

The trap is complacency. AI generates tests fast, and it’s easy to see a wall of green and move on without checking whether the tests actually make sense. A human usually writes tests deliberately, thinking about what each one covers. AI can be prompted to do the same — the problem is volume. It’s a lot of non-production code that needs review.

Agents have a target: a passing test. If they can’t make it pass after a few attempts, they’ll find another way: hollow out the assertion, skip the check with a comment like “not needed in production,” add a TODO and move on. They’ll report success. The behavior is a form of cheating — the agent reaches its goal by arguing the problem away. Sometimes the argument is that the assertion is too strict, sometimes that the check doesn’t matter in production. The argument leads to a passing test, which is what was asked for.

No single layer catches everything. I define the top-level scenarios — the end-to-end tests that verify how the system actually behaves — and extend them whenever a problem surfaces. Strong instructions push the agent toward more diligent test writing. Automated reviewers scan specifically for tests that can never fail, tests that verify mocks instead of logic, and tests that have been hollowed out.

I briefly review the declared intent of each test to check whether it makes sense. Test coverage tools add another signal. Some percentage of tests might still be useless. The system works not because every test is meaningful, but because the layers together catch enough. It’s probabilistic, not deterministic — and I find the results to be reliable.

I’m not striving for perfection, at least not yet. I aim for a result on par with what I’d produce manually, and that seems to hold — I don’t find more unresolved edge cases or missed problems than when I wrote the code myself. There’s always something you forget.

The loop has more than unit tests. It includes integration tests with real databases, end-to-end tests through the UI, and reviewer agents checking the implementation from multiple angles. Each layer provides feedback to the implementing agent. The agent adjusts, the reviewers check again.

The rest becomes affordable

Some concerns that used to be “nice to have, too expensive for a solo developer” are now cheap.

Monitoring endpoints, Prometheus integration, CI pipelines, deployment scripts — necessary to move fast without breaking things, but always first to be sacrificed when time is tight. The CI pipeline breaks. There’s a hotfix to ship, so I build the docker image ad-hoc instead of fixing the pipeline. Next time, same thing. The tooling degrades one shortcut at a time, because the immediate problem is always more important.

The agent writes that code too. Observability, test coverage, infrastructure automation — things that used to compete with the core product for my time are just more tasks for the agent.

I handle security differently. For standardized concerns — authentication, session management, encryption — I use proven tools rather than letting agents improvise them. I delegate authentication to an identity provider through OpenID Connect, and use battle-tested libraries for the rest. These problems have been solved by others. Solving them again, whether by hand or by AI, creates new vulnerabilities.

Application-specific security is harder: authorization rules, data isolation, input validation tied to the domain. There’s no library that covers these — they have to be written for the specific application. I still use established libraries for the building blocks that do exist, and I don’t let the AI improvise security-related code on its own. AI makes the same subtle mistakes humans do, and those mistakes do more damage in security than anywhere else.

I review the code, checking that the patterns and libraries are applied properly. I’m afraid I wouldn’t catch the subtle errors. That is why I rely on proven code for security as much as I can.

The dangerous part

Given a goal and a set of tools, agents find creative ways past obstacles. That helps when the goal is the problem in front of them. It worries me when they have access to more than they need.

If an agent can see production database credentials — because they’re in an environment variable somewhere reachable — it will eventually use them. Not maliciously. It’ll be diagnosing a test failure and decide to verify against production. It might just as well drop the whole thing. These catastrophic mistakes happen, and they’d happen to inexperienced humans too.

I sandbox the agents and only give them what they need for the task. I run them in a restricted environment where they can do whatever they want — and I make sure that environment has no access to anything else without my explicit permission.

The agent is not to be trusted. It needs to operate in a tight verification loop — tests, review, constrained access. Without that, the results are unreliable. It produces excellent results on activities that can be predicted from instructions and previous examples. When the judgment calls are open-ended and there’s no verification, the output is plausible nonsense.

Most of what I build is predictable based on the inputs and prior work. That’s why this works. The agents are very useful when I’m exploring a domain new to me — I can learn through them. I don’t know how they would do in a domain that’s also novel to them, where the patterns aren’t well-represented in their training. I’m not currently working in that territory.

What this means

For most of my career, my constraint has been my own ability to produce code. I always knew exactly what I wanted to build. I had to compromise — sometimes dearly — because there was never enough time to code it all. One more refactoring to make the code cleaner, one more test to cover an edge case — always deferred.

That constraint has been lifted. My ability to produce code is now limited by the quality of my instructions, my orchestration ability, and the size of my token budget — not by how fast I type. And that’s how constraints work — you lift one and the next one reveals itself. The next constraint is making sure the output stays good: establishing solid review practices, maintaining the verification loop, understanding the full territory of software engineering well enough to direct agents through it.

The ability to read code remains immensely useful, as does knowing the tools and patterns available in a given language. Writing code is no longer where I spend my time. Understanding architecture, testing, security, operations — that’s where my attention needs to be now.