Agentic Development With Claude: Full Loop AI Coding

Last October I handed a fairly boring task to Claude: "Scaffold a WordPress REST API wrapper in PHP, write PHPUnit tests for it, run them, fix whatever fails." I gave it access to my terminal via a local Claude tool-use setup and walked away to make tea. Came back twelve minutes later. Tests were green. I had a working wrapper with 94% coverage and a small inline comment where Claude had caught an edge case I hadn't mentioned in the brief. I stood there in my kitchen in Bermondsey genuinely unsettled.

That's agentic development. Not autocomplete, not a smarter Stack Overflow. A model that reasons about a goal, picks the next action, executes it, observes the result, and loops until done. And it's changing how I run projects at Seahawk Media faster than almost anything in the last nine years.

What "Agentic" Actually Means (And What It Doesn't)

Let's be precise, because this word gets thrown around loosely. An agentic AI loop has three things: a goal, a set of tools, and the ability to decide what to do next based on what just happened. The model isn't just generating text. It's acting, observing, and replanning.

What it's not is magic. The model can still hallucinate a function signature. It can loop itself into a corner doing the same wrong fix seven times. It can misunderstand your goal at step one and build confidently in the wrong direction for ten steps. I've seen all of these. Once, on a React dashboard project, Claude spent about twenty minutes adding increasingly baroque null checks to solve a problem that was actually a missing await. That one was my fault, I gave it a vague initial spec.

The distinction that matters for practitioners: narrow agentic tasks beat open-ended ones. "Write and test a slug sanitisation function that handles Arabic, Japanese, and emoji" is a great agentic task. "Build me a SaaS" is not. Scope it tight, or you'll spend more time reviewing wrong turns than you would have spent just writing the code.

The Actual Stack I'm Using

Tools matter enormously here. Without the right scaffolding, "agentic Claude" is just a chat window.

My current setup at Seahawk:

[Claude API](https://www.anthropic.com/api) with tool-use, specifically the computer_use beta and custom bash/filesystem tools
Cursor as the IDE layer, with Claude 3.5 Sonnet set as the backend model
pytest / PHPUnit / Jest depending on the project, because Claude needs a deterministic signal to loop on. Without test output, it's flying blind.
A short system prompt that tells Claude what the project structure is, what the coding standards are, and, this is important, to stop and ask if it's about to create a new file outside the specified directory.

That last constraint sounds minor. It isn't. Agentic models will happily scaffold entire new modules if they think it serves the goal. Guardrails on the file system have saved me from several "where did this come from?" moments.

One thing I don't use: multi-agent orchestration frameworks for most work. LangChain, AutoGen, CrewAI, they're genuinely interesting, but for solo developer or small-agency use, the overhead of configuring agents-talking-to-agents usually isn't worth it. One well-scoped Claude loop beats three poorly scoped agents yelling at each other.

How I Structure a Loop-Worthy Task

Here's what I've settled on after probably 200+ agentic sessions this year. Give the model a task brief that has four things in it:

The goal, specific, testable, small enough to finish in under 30 steps
The starting state, what files exist, what tests already pass
The success condition, usually "all tests green" or a specific function signature it must produce
The stop condition, "if you've tried the same fix more than three times, stop and explain why you're stuck"

That fourth one is underrated. Without it, Claude will sometimes loop forever. Not forever-forever, but I've watched it make 18 attempts on a gnarly regex problem, each slightly different, none correct, never stopping to say "I'm not sure." Telling it explicitly to surface confusion is actually something the Anthropic model spec discusses in terms of the model's approach to uncertainty, but in practice, you still need to prompt for it.

Back in 2022 a client gave us a job migrating 14,000 product listings from a legacy Magento install to WooCommerce. At that point Claude wasn't doing agentic loops, so we wrote the migration scripts by hand over two weeks. Same job today? I'd write a tight spec, hand it to Claude with read access to the Magento DB schema and write access to a staging WooCommerce instance, and let it run. I genuinely think we'd be done in two days. That's the delta.

Where Claude Is Surprisingly Good

Refactoring Existing Code

This is where I've been most impressed. Give it a messy 400-line class, tell it to refactor toward single responsibility, and let it run its own unit tests as checkpoints. It holds context across the whole file better than I expected and it's genuinely careful about not breaking passing tests. The output isn't always the architecture I would choose, but it's usually defensible.

Writing Tests for Code You Didn't Write

Seahawk does a lot of site audits and rescues. We inherit codebases all the time, often with zero tests and no original developer in sight. I've started using agentic Claude specifically to write a test suite for inherited code before we touch anything. It reads the source, infers intent from function names and comments, writes tests, runs them, and adjusts when something fails unexpectedly. Last month it caught a silent data corruption bug in a custom WooCommerce order handler that had probably been there for two years. Nobody knew.

Grinding Through Boilerplate

REST endpoint scaffolding, CRUD migrations, admin panel forms. The boring stuff that takes a competent developer an afternoon and nobody enjoys. Claude is fast and consistent here, and consistency is actually what you want in boilerplate, it doesn't get creative, it doesn't get tired, it just patterns-matches your existing code and extends it.

Where It Goes Wrong

Honestly, the failures are instructive. Here are the ones I hit most often:

Context window overflow on large codebases. Claude 3.5 Sonnet has a 200k token context window, which sounds enormous until you're feeding it a full WP plugin with 40 files. It starts forgetting things it saw early in the session. Solution: break the job into smaller loops with explicit checkpoints.
Confidence on things it shouldn't be confident about. Claude will fix a database query, run the test, it passes, and report success, but the query is now subtly less efficient because it swapped an index-friendly WHERE clause for a subquery. It solved the stated problem and created an unstated one. Code review still matters.
Tool permission creep. If you give it bash access and don't constrain it, it will run npm install for packages you didn't ask for, or worse, make a network request you didn't sanction. This isn't malicious, it's the model doing what seems helpful. Set your tool permissions before you start, not after something weird happens.

A note on security: if you're running agentic loops against anything connected to production data, please read Anthropic's own guidance on tool-use safety. It's not long, and it will save you a bad day.

Prompting for Agentic Behaviour vs. Chat Behaviour

The prompting model is different and this trips people up. In a chat context you're conversational, iterative, back-and-forth. In an agentic context the initial prompt is a specification document. You won't be there to clarify mid-task.

Things that make agentic prompts work:

State the constraints first, not last. Most people bury them.
Tell it what not to do. "Do not modify any file outside /src/utils" is more useful than ten lines of positive instructions.
Give it an escape hatch. "If you reach a decision point where proceeding would require changing the database schema, stop and write a summary of why."
Reference the test runner command explicitly. "Run ./vendor/bin/phpunit tests/ after every change and use the output to guide your next step."

The framing shift is: you're writing a brief for a contractor who is very capable but can't ask questions. So write like one.

The Autonomy Dial: How Much to Let It Run

This is the question I get from other agency owners most. Do you babysit it or let it go?

My answer after a year of this: it depends entirely on the reversibility of the actions. Read-only research, test writing, scaffolding into a new directory, let it run. Anything that touches a live database, modifies package manifests, or interacts with external APIs, check in every few steps, or at least read the plan before it executes.

The ReAct prompting pattern (Reason + Act, from the 2022 Yao et al. paper) is worth understanding here. It's essentially what Claude does internally when you give it tools: it thinks out loud about what to do, does it, reads the result, thinks again. Making that reasoning visible, asking Claude to print its plan before each action, gives you a natural review point without breaking the loop.

I've started treating Claude's step-by-step reasoning output the way I treat a junior developer's PR. I skim it. If something looks off, I intervene. If it looks reasonable, I let it proceed. That mental model has served me well.

FAQ

Is agentic Claude actually production-ready for client work?

For specific, bounded tasks on non-production environments: yes, absolutely. I use it regularly for the scaffolding, refactoring, and test-writing phases of projects. For anything touching a live client database or external payment API, I keep a human in the loop at every execution step. The model is capable; the risk is in the blast radius of a mistake, not the model itself.

What's the difference between agentic Claude and just using Cursor or GitHub Copilot?

Cursor and Copilot are inline code suggestions and chat interfaces. They react to what you type. Agentic Claude takes a goal and executes a multi-step plan on its own, using tools like a terminal, file system, or web browser. It's the difference between an autocomplete engine and a process that can run unsupervised for ten minutes and come back with a completed task.

Do I need to know how to code to use this?

You need enough context to write a coherent spec and to review the output critically. If you can't read a diff and tell whether the change makes sense, you're going to have a bad time. Agentic AI amplifies competence. It doesn't replace the baseline.

Which Claude model should I use for agentic loops?

Claude 3.5 Sonnet is my current default. It hits a good balance between reasoning quality and speed, which matters when you're paying per token across a 30-step loop. Claude 3 Opus is better at very complex reasoning tasks but slower and more expensive, I use it for the initial planning step on large jobs, then hand off to Sonnet for execution.

---

The thing I keep coming back to is that agentic development isn't really about AI replacing developers. It's about changing what a developer's time is worth. The twelve minutes I didn't spend writing that PHP wrapper last October, I spent thinking about architecture. That's a trade I'll take every time.

Pick your view

Agentic Development: Letting Claude Drive the Whole Loop