tests-and-ai-coding-tools.html
< BACK TO BLOG Worn desk with notebooks, open laptop showing code, and a cup of tea under cool London window light

Why I Started Writing Tests Again (AI Made Me)

Back in early 2022, I made a quiet decision: I stopped writing unit tests for most of the WordPress and Node projects coming through Seahawk. Not loudly. No blog post about it. I just... stopped. The justification was sound, I thought — we were shipping 15 to 20 client sites a month, I had three other developers on rotation, and the tests I was writing felt like documentation nobody read. Manual QA caught the real bugs. Tests were theatre.

Fast-forward to late 2023. GitHub Copilot had been in my editor for about eight months. I'd also started using Cursor on the side for anything greenfield. The speed was genuinely remarkable. But something started happening. Bugs were appearing in places I hadn't touched. Logic that looked correct was wrong in edge cases I'd never thought to check. And the worst part — the AI had no idea it was wrong. It wrote the broken code with the same confident indentation it always does.

That's when I picked the tests back up.

---

The Period I Dropped Testing (And Why It Made Sense At The Time)

Honest answer? For a certain type of project, skipping tests was the right call. If you're building a five-page brochure site in WordPress, writing PHPUnit tests for a contact form plugin is theatre. I stand by that.

Seahawk's bread and butter for a long time was exactly that work — high-volume, relatively low-complexity, well-defined scope. A client hands you a Figma file, you build it, you QA it, you ship it. The feedback loops were short. If something broke, you'd know within hours. Writing tests for that context is the developer equivalent of laminating a Post-it note.

But I generalised that lesson too aggressively. I started treating all projects like brochure sites. Even the ones with custom WooCommerce checkout flows. Even the fintech dashboard we built in early 2023 for a client in Frankfurt — full custom REST API, JWT auth, three different user permission tiers. No tests. Just "careful manual QA." That was arrogant, and it bit us.

The Frankfurt project shipped with a permissions bug that let editor-level users query admin-level data under a specific combination of filters. We didn't catch it until their internal team ran a security review six weeks post-launch. Embarrassing. Fixable. But the kind of thing a basic integration test would have flagged before we even raised a pull request.

---

What AI Coding Tools Actually Changed

Here's the thing most people miss when they talk about Copilot or Cursor or whatever model is hot this month: the code looks right. That's the problem.

When a junior developer writes buggy code, you can often see the uncertainty in it. Odd variable names, a comment that says // not sure about this, a function that's clearly copy-pasted twice. The code telegraphs its own fragility. AI code doesn't. It's stylistically consistent, well-named, and structured in a way that reads as intentional. The confidence is entirely cosmetic.

Studies from Stanford's Human-Computer Interaction group have flagged that developers using AI assistants tend to over-trust generated code on first read. That tracks with my own experience. I would glance at a 40-line function Copilot had written, think "yeah, that's basically what I'd have written," and move on. Sometimes it was fine. Sometimes it had silently misunderstood what I actually needed.

The specific failure mode I kept hitting: conditional logic around edge cases the AI had no reason to anticipate. It would write a function that handled the happy path perfectly and then quietly fail on null inputs, empty arrays, or non-standard date formats. Things that would have taken me thirty seconds to think about if I'd written the code myself, because I'd have been thinking as I typed.

The Speed Trap

There's a real productivity trap here. The AI makes you fast. Fast feels like good. You start shipping faster and you start reviewing less carefully because the velocity feels like evidence of quality. It isn't. Speed and correctness are not correlated when you're prompting a language model.

I put roughly 40% more features into a client project last September than I'd have managed without AI assistance. The project also had more post-launch bugs than anything I'd shipped in two years. Not catastrophic bugs. But annoying ones. The kind that erode client trust.

---

Why Tests Work Differently Now (With AI In The Loop)

When I came back to testing, I didn't come back to the old workflow. Writing tests first, then implementation, then AI-assisted code review — that's the loop I've settled into now.

The interesting thing is that AI is actually excellent at writing tests, in a way it isn't always excellent at writing application logic. Give Copilot a well-defined function signature and ask it to generate a test suite and it'll produce edge-case coverage I'd have taken twenty minutes to write manually. It imagines unhappy paths well when the task is specifically "find ways this can break."

So I've sort of inverted the thing. I write the test spec. AI fills out the test cases. Then AI writes the implementation. Then I read the implementation through the lens of those tests, rather than just reading the code cold.

It's slower than pure vibe-coding. But it's faster than the old write-everything-manually-including-tests workflow. And it's shipped zero permissions bugs since Frankfurt.

The Tools I'm Actually Using

  • [Vitest](https://vitest.dev) for anything JavaScript or TypeScript. Replaced Jest for me entirely last year — the config is saner and the watch mode is quick.
  • PHPUnit still, for WordPress and custom PHP work. Nothing has replaced it.
  • Cursor's "test this function" shortcut — genuinely one of the most useful single features in any editor I've used.
  • GitHub Actions for CI. Tests run on every push to main. Takes about 90 seconds on most projects.

---

The Argument Against Tests (Steel-Manned)

I want to give this position a fair hearing because I held it for nearly two years.

The real argument isn't "tests are useless." It's "tests have a cost and many projects don't justify that cost." Writing and maintaining a test suite takes time. On a project with a short lifespan — a campaign microsite, a marketing landing page, a hackathon prototype — that time investment has zero return. The project will be dead before the tests save you anything.

And there's a subtler point: bad tests are worse than no tests. A test suite that passes because the tests are tautological (you're essentially testing that your function returns what you told it to return) gives you false confidence. I've seen this at agencies. Developers writing tests that always pass because nobody challenged what they were actually verifying.

Martin Fowler has written well about this — coverage percentages are not a measure of test quality. A 90% coverage number can mask a completely hollow suite.

So: don't test everything. Don't test because it feels professional. Test because you've identified logic that is load-bearing and would be expensive to break.

---

What I Test Now (And What I Don't)

Here's the actual decision I've landed on after the last eight or nine months:

I test:

  1. Any function that handles money, permissions, or data transformation
  2. Any API endpoint that isn't a straight CRUD pass-through
  3. Custom business logic where the client has specified exact behaviour in writing
  4. Anything an AI wrote that I didn't fully read line-by-line

I don't test:

  • UI rendering (Snapshot tests have never saved me once in nine years. Not once.)
  • Third-party API wrappers where the external behaviour is out of my control
  • One-off scripts that run once and get deleted
  • Standard WordPress hooks unless they're doing something unusual

That's it. No grand philosophy. Just a list based on where I've been burned.

---

The Workflow That Actually Works For Me

Since a few people have asked in Slack communities I'm in, here's the actual sequence:

  1. Write a brief spec comment at the top of the file — what this module does, what it doesn't do, edge cases I already know about.
  2. Ask Cursor to generate test cases from that comment before writing any implementation.
  3. Review those test cases. Delete the stupid ones. Add any the AI missed.
  4. Let Copilot or Cursor write the implementation.
  5. Run the tests. They will fail. Fix the implementation (not the tests).
  6. Read the diff before pushing — AI-assisted code still needs a human pass.

Step 6 is non-negotiable. I've caught three genuinely bad bugs in the last four months just by reading the diff slowly before pushing. Nothing clever. Just reading.

Kent Beck's original framing of TDD was never about 100% coverage or perfect methodology. It was about building a feedback loop fast enough to catch mistakes before they compound. That idea — fast feedback loops — is more relevant now than it was in 2003. Because the AI makes mistakes faster than any developer I've ever hired.

---

FAQ

Does this slow down your delivery speed?

By about 10 to 15% on complex projects. On simple ones, maybe not at all — AI generates the tests so quickly that the overhead is minimal. For projects where a bug would cost real money to fix post-launch (and most real-money projects qualify), that 15% is worth it a hundred times over.

What about TypeScript? Doesn't strong typing replace a lot of tests?

Partially. TypeScript catches a whole class of errors at compile time that you used to need tests for. But types don't test business logic. They don't verify that your discount calculation function applies the right rules for wholesale customers. That's still on you.

Should junior developers use AI coding tools if they're not writing tests?

No. Strong opinion. A junior developer using Copilot without tests is essentially flying a plane on autopilot without understanding how the autopilot works or how to land manually. The AI will produce code that looks senior-level, the junior won't know which parts to distrust, and you'll have a production incident eventually. Tests at least give them a mechanism for verifying the output they're accepting.

Why did you stop testing in the first place, honestly?

Burnout, partially. And a period where every project was genuinely simple and tests genuinely weren't earning their keep. The mistake was not noticing when the project complexity changed and adjusting accordingly. That's the real lesson — not "always test" or "never test" but knowing which category a given project falls into.

---

Writing tests didn't use to feel like protection. It felt like paperwork. The AI changed that. Not because the AI is bad — it's made me meaningfully faster — but because it introduced a new class of confident, well-formatted, plausible-looking mistakes that I can't catch by reading code the way I used to. The tests aren't for the AI. They're for me. A forcing function to think about what I actually need the code to do before I accept whatever the model hands me.

I wish I'd framed it that way two years ago.

< BACK TO BLOG