Prompt Engineering for Production Code: What I Learned

Q: Do you use system prompts or just chat prompts?

Both. In Cursor I lean on the .cursorrules file for persistent project-level instructions — things I'd otherwise paste every time. For one-off tasks in the ChatGPT or Claude web UI, it's all in the chat. The .cursorrules approach has meaningfully reduced repetition and the model stays more consistent across a long session.

Q: What's your honest take on AI replacing developers?

It's replacing certain tasks, not developers. The judgment calls — what to build, how to architect it, which trade-off fits this client's actual situation — those are nowhere near automated. If anything, the developers getting squeezed are the ones who were purely execution-oriented, not design or architecture-oriented. Sharpen the judgment layer. That's the defensible bit. --- Nine years of building things on the web, and the fundamental lesson keeps repeating: the quality of your output is determined by the quality of your inputs. That was true when it was client briefs. It's true now with prompts. The model is a fast, occasionally brilliant, frequently overconfident junior developer. Manage it accordingly.

It was 11:43pm on a Thursday and I was staring at 340 lines of React code that GPT-4 had generated with total confidence. Clean. Well-commented. Completely broken in production. The custom hook was managing state in a way that caused silent re-render loops, the kind that don't throw errors — they just quietly murder your performance until a client rings you on a Friday morning asking why their checkout page takes nine seconds to load.

That night taught me more about prompt engineering than any YouTube tutorial or Twitter thread ever has. And I've had plenty of those nights since.

I've been building on the web for nine years. At Seahawk Media we've shipped well over 12,000 sites — WordPress, headless builds, bespoke React apps, WooCommerce stores handling serious transaction volume. AI coding assistants entered my workflow properly around early 2023, and I've swung between thinking they're miraculous and wanting to throw my laptop into the Thames.

Here's what I've actually learned. The hard way.

---

The Model Doesn't Know Your Codebase. You Have to Tell It.

This sounds obvious. It isn't, not in practice.

The single biggest mistake I see developers make — including myself for the first six months — is treating an LLM like a senior engineer who's already read all your code. You ask "write a function to handle user authentication" and it writes something technically correct in a vacuum. But your project uses Supabase, not Firebase. Your tokens live in httpOnly cookies, not localStorage. Your error format is { status, message, data }, not whatever the model defaulted to.

The model isn't wrong. It just doesn't know you.

Give It a Project Preamble, Every Single Time

I now start every meaningful coding session with what I call a "context block." Takes about 90 seconds to write. Looks something like:

Stack: Next.js 14 (App Router), TypeScript, Supabase, Tailwind CSS 3.4
State: Zustand, no Redux anywhere
Auth: Supabase Auth with httpOnly cookies via middleware
Error shape: { success: boolean, error?: string, data?: unknown }
Styling convention: utility-first, no custom CSS files unless absolutely necessary

Paste that before any non-trivial request. I do this in Cursor by keeping a _context.md file in the project root. Two keystrokes to paste. The output quality jumps noticeably — fewer assumptions, fewer things I have to rip out.

---

Specificity Is the Whole Game

Back in 2022, before I was using AI heavily, a client handed me a brief that was literally two sentences: "Build us a booking system. Make it good." We spent three weeks going back and forth on scope. That experience stuck with me, and it directly shapes how I write prompts now.

Vague prompt → vague code. Every time.

"Write a function that fetches orders" will get you something. "Write a TypeScript async function called fetchOrdersByUser that accepts a userId: string, queries the orders table in Supabase where user_id matches and status is not cancelled, orders results by created_at descending, and returns Order[] or throws a typed error" will get you something you can actually ship.

The difference isn't the model's capability. It's the prompt's specificity.

What to Include in a Code Prompt

Function name and signature — don't let the model invent naming conventions
Input types and output types — TypeScript generics if relevant
The data source — which table, which API endpoint, which cache layer
Edge cases you already know about — "handle the case where the array is empty"
What NOT to do — "don't use useEffect for this, use a server action"

That last one matters more than people realise. Telling the model what to avoid saves enormous time. I've started keeping a small "anti-patterns" note per project — things like "no client components unless user interaction requires it" — and I include relevant lines in prompts for that project.

---

Chain Your Prompts. Don't Ask for Everything at Once.

Seahawk had a fintech client in late 2023 — I won't say who — where we were building a multi-step KYC flow. Complex stuff. Document upload, liveness check integration, status polling. I made the mistake early on of asking GPT-4 to "build the full KYC flow component." It produced 600 lines of heroic-looking garbage. Tangled logic, mixed concerns, no real separation between UI state and business logic.

So I scrapped it and started again with a chain.

First prompt: "Design the state machine for a 4-step KYC flow. Steps: identity, document upload, liveness, review. Give me the state type and transitions only, no UI."

Second prompt: "Given this state machine [paste], write the Zustand store."

Third prompt: "Given this store [paste], write the StepIdentity component. Just this step."

The output from the chained approach was usable. Not perfect — I still rewrote about 30% — but usable. The monolithic approach gave me nothing.

Anthropic's own guidance on prompting talks about breaking complex tasks into subtasks, and honestly, this aligns exactly with what I found through trial and error. Break the problem down before you break your codebase.

---

Make It Argue With Itself

Here's one I stumbled onto completely by accident. I was reviewing a generated utility function and instead of just running it, I added a follow-up prompt: "What are the potential bugs or edge cases in the code you just wrote?"

The model found three issues it hadn't accounted for. One of them was a genuine problem — a race condition in an async loop that would've been a nightmare to debug in production.

Now I do this routinely. Write the code, then ask it to critique the code. Then ask it to fix the critique. It feels slightly absurd — asking the model to review its own work — but it consistently surfaces things I'd have caught only after a painful debugging session.

You can take this further. After getting a working function, try: "Rewrite this with a focus on performance" or "How would this behave under high concurrency?" The answers aren't always applicable, but about 40% of the time they surface something worth acting on.

---

The "Role + Constraint" Frame

There's a prompt pattern I use constantly now that I wish I'd figured out in year one. It goes: "You are a [specific type of engineer]. Your constraint is [hard rule]. Now [task]."

Example: "You are a backend engineer who cares deeply about database query efficiency. Your constraint is that you cannot fetch more than what's needed for this render — no over-fetching. Write a Supabase query for the admin dashboard that returns order count, total revenue, and the five most recent orders."

That framing does two things. It aligns the model's "persona" with what I actually need. And the constraint acts as a guardrail — something the model explicitly checks itself against as it generates.

OpenAI's prompting best practices describe a similar idea around giving the model a persona with explicit instructions. Worth reading if you haven't, though I'd say the constraint piece is underemphasised in their docs.

Compare the output of that framed prompt to "write a Supabase query for the admin dashboard." Night and day. Genuinely.

---

When to Stop Prompting and Just Write the Code

This is the part nobody wants to say out loud.

AI coding tools are brilliant at: boilerplate, CRUD operations, utility functions, writing tests for code you've already written, translating between formats (JSON schema to TypeScript type, SQL to Supabase query, etc.), and first drafts of things you'll heavily modify.

They are genuinely poor at: understanding your app's real architecture, knowing which trade-off matters for your specific scale, writing anything that touches a tricky stateful interaction without heavy guidance, and anything where the spec is fundamentally ambiguous.

I have a personal rule now: if I've sent more than four follow-up prompts trying to get a piece of code right, I close the chat and write it myself. The time cost of prompt debugging can exceed the time cost of just writing it, especially for anything under about 50 lines.

The Stack Overflow Developer Survey 2024 found that 76% of developers are using or planning to use AI tools — but the same data showed relatively low trust in accuracy. That gap between usage and trust is exactly where good prompt engineering lives.

---

Version Your Prompts Like You Version Your Code

Last year I started keeping a prompts/ folder in projects where AI assistance is significant. Markdown files. One per major feature area. When a prompt produces particularly good output, I save it. When I find a better version, I update the file.

Sounds obsessive. It's saved me probably six hours on the last big project alone — a headless WooCommerce build for a retailer moving from Shopify. I reused a product query prompt (with minor edits) across four different components instead of re-engineering the context from scratch each time.

Git-track it. Seriously. Prompt quality is reproducible if you treat prompts as artefacts rather than throwaway inputs. LangChain's prompt templates formalise this idea in a framework context, but you don't need any framework — a folder of Markdown files is enough for most agency workflows.

---

FAQ

Is prompt engineering actually a transferable skill or just model-specific?

Mostly transferable. The core principles — specificity, context-setting, chaining, critique loops — apply across GPT-4, Claude 3.5 Sonnet, Gemini, whatever comes next. The syntax varies a bit and some models respond better to certain framing, but the underlying logic holds. I've found Claude tends to respond well to explicit constraints; GPT-4 responds well to examples. Minor differences, same fundamentals.

How do you handle AI-generated code in code review?

Same as any other code. If it's going into production, it gets reviewed. Full stop. I've stopped flagging "this was AI-generated" in PRs at Seahawk because it became a red herring — reviewers would scrutinise it differently, sometimes unfairly, sometimes not enough. The code stands or falls on its own merits. What I do flag: any section where the logic is non-obvious and I haven't added inline comments explaining the reasoning.

Do you use system prompts or just chat prompts?

Both. In Cursor I lean on the .cursorrules file for persistent project-level instructions — things I'd otherwise paste every time. For one-off tasks in the ChatGPT or Claude web UI, it's all in the chat. The .cursorrules approach has meaningfully reduced repetition and the model stays more consistent across a long session.

What's your honest take on AI replacing developers?

It's replacing certain tasks, not developers. The judgment calls — what to build, how to architect it, which trade-off fits this client's actual situation — those are nowhere near automated. If anything, the developers getting squeezed are the ones who were purely execution-oriented, not design or architecture-oriented. Sharpen the judgment layer. That's the defensible bit.

---

Nine years of building things on the web, and the fundamental lesson keeps repeating: the quality of your output is determined by the quality of your inputs. That was true when it was client briefs. It's true now with prompts.

The model is a fast, occasionally brilliant, frequently overconfident junior developer. Manage it accordingly.

Pick your view

Prompt Engineering for Production Code: Hard Lessons