From Idea to Testable Output: The Engineer's Guide to AI-Driven Development

Jake Ruesink
AI
28 May, 2026

How to set up your project so an AI agent can move fast — and you can trust what it ships.

The Wrestling Match

A developer I’ve been coaching sent me a message recently that I think a lot of engineers will recognize:

Sometimes I don’t quite get the result I wanted from the prompt — I get changes that feel ‘close enough.’ So I start asking it to make adjustments, and before I know it, I’m asking for far more changes than I expected. At what point would you say it’s time to throw it away and focus on fixing the prompt instead? It starts to feel like I’m wrestling with the AI instead of guiding it.

That feeling — wrestling instead of guiding — is the central experience of learning AI-driven development. And the way out of it isn’t a better prompt. It’s a better project.

Here was my answer: most people expect the AI to fully understand what they’re imagining, read their plan perfectly, and execute it against the tech stack with all the right best practices. The models aren’t good enough to do that yet — to go from even a great plan straight to flawless execution. There’s a messy middle.

The move that closes the gap is this: as you feel the pain, form the repo around the LLM. Every time the agent gets it wrong, you’ve learned a shortcoming. Encode it — into a rule, a doc, a test, a better feedback loop — so the next session operates the way you’d expect. The more you iterate like that, the more the wrestling disappears.

The realization that clicked for this developer is one worth stating plainly: rules and docs aren’t just for coding standards. They’re for making sure the plan actually executes in a way that works.

The Core Question

So stop asking “what can AI do?” and start asking: “How do I set up my project so I can go as quickly as possible from idea to testable output?”

That’s the whole game. Every technique in this guide serves that one question. The answer isn’t about which model to use — it’s about how you structure your codebase, your rules, your tests, and your environments so an AI agent can operate with speed and you can verify what it produces without burning hours on review.

AI doesn’t replace engineering judgment. It amplifies whatever your codebase already is. Good codebases get better with AI. Bad codebases get worse, faster. This guide is about making your codebase the kind that AI accelerates.

When do you throw it away? Back to the original question. My rule of thumb: if you’re more than two or three “close enough” rounds deep and still steering, stop. Throw away the diff — it’s cheap. The signal isn’t “the code is wrong,” it’s “my prompt or my project didn’t carry enough intent.” Fix that — the prompt, or better, a rule or doc — and rerun from a clean state. It’s all throwaway code. The iteration is where you learn.

Part 1: Think About Your Codebase Like an Agent Would

Before writing a single rule or prompt, put yourself in the agent’s shoes. It has your repo cloned. It can read files, run commands, and write code. But it doesn’t have your years of context. It doesn’t know why things are the way they are. It just sees what’s there.

Make the implicit explicit

The agents that produce the best work operate in codebases where decisions are documented, patterns are consistent, and conventions are visible. In practice:

Consistent naming conventions. If you name components PascalCase, do it everywhere. If API routes follow a pattern, follow it everywhere. The agent reads patterns to infer rules — inconsistent naming means inconsistent output.
Clear file structure. When the project has a predictable layout (components here, services there, tests alongside source), the agent can navigate without guessing. Add a brief ARCHITECTURE.md or CONVENTIONS.md at the root.
Ubiquitous language. Borrow from DDD — maintain a shared vocabulary for domain concepts. If the team calls it “cart” and the codebase has Basket, OrderDraft, and ShoppingCart used interchangeably, the agent will be confused. Pick one term and use it everywhere.

Deep modules, simple interfaces

This comes from John Ousterhout’s A Philosophy of Software Design, and it matters more with AI than without. Design your modules so they have:

Few, large modules with simple interfaces
Complexity hidden inside — the agent sees a clean surface, not implementation details

Shallow modules with many small interfaces confuse agents. Deep modules with clear contracts let the agent work confidently.

Eliminate dead patterns

If there are patterns in the codebase you don’t want replicated, remove them or add explicit rules against them. “Do as I say, not as I do” doesn’t work with AI — it reads the codebase as truth. As my colleague Derek puts it: “The sloppier your code is, the worse the code AI is going to produce.” He takes the inverse seriously too — in his Medusa projects he gitignores the entire Medusa codebase into the repo so the agent can read it directly while it plans. Give the agent the right context and the right materials, and it stops guessing.

Part 2: Set Up Rules and Docs That Actually Help

Every AI coding tool — Cursor, Claude Code, Codex, Windsurf, Copilot — reads some form of project instructions. Cursor uses .cursor/rules/, Claude Code uses CLAUDE.md, Codex uses AGENTS.md or codex.md. The specifics vary, but the principles are the same.

The rule-writing rubric

Good rules share these properties:

1. One rule, one concern. If a rule covers two things, split it. A rule that says “use TypeScript strict mode AND prefer functional components” will be half-ignored. Two rules, each specific.

2. State the why, not just the what. Rules without rationale become cargo cult. “Don’t use useEffect for data fetching” → the agent follows it mechanically. “Don’t use useEffect for data fetching — use your router’s loader instead, because useEffect causes waterfalls and stale data” → the agent can reason about edge cases.

3. Earned, not preemptive. A rule should exist because a failure mode was observed, not imagined. Came up once? It’s a note. Three times? It’s a rule candidate. Across projects? It’s an archetype.

4. Keep it short. Target 50–150 lines per rule. If you need more context, summarize the principle and link to deeper docs. Long rules get skimmed — by humans and agents alike.

5. Include examples. One DO, one DON’T, from real code when possible. Examples teach faster than prose.

Standard rule structure

Every rule should have:

Title:    What it enforces, in plain language
Why:      The failure mode this prevents (1–2 sentences)
Rule:     The heuristic itself, concise and direct
Examples: One DO, one DON'T
Scope:    File patterns or conditions for when this fires
See also: Links to deeper docs or related rules

Where to put them

Always-on rules (the “constitution”) — foundational, universal, lean. These load every session.
Context-matched rules — loaded when the agent touches specific files or directories. Most rules belong here.
On-demand rules — deep knowledge, pulled when the agent encounters a specific problem. Link to these from lighter rules.

The `.cursor/rules/` pattern (Cursor)

Cursor supports multiple rule files that can be set to always apply, match glob patterns, or be manually referenced:

.cursor/rules/
  always.mdc           # Always loaded — coding standards, project conventions
  react-components.mdc # Loaded when touching component files
  api-routes.mdc       # Loaded when touching API routes
  testing.mdc          # Loaded when touching test files
  database.mdc         # Loaded when touching DB schema/queries

Each .mdc file has a frontmatter block:

---
description: Rules for React component patterns
globs: ["src/components/**/*.{tsx,jsx}"]
alwaysApply: false
---

The `CLAUDE.md` pattern (Claude Code)

Claude Code reads CLAUDE.md files hierarchically — root, then per-directory. Put project-wide guidance at the root and module-specific guidance near the code it governs.

The `AGENTS.md` pattern (Codex / general)

AGENTS.md is emerging as a cross-tool standard. Start with this — most tools will read it, and you can always add tool-specific files alongside.

Anti-patterns to avoid

Rules that repeat what’s in the code. If the types already say it, don’t write a rule about it.
Rules that are really documentation. “How the auth system works” belongs in docs, not rules. Rules should steer behavior.
Rules that contradict each other. Audit periodically. Conflicting rules are worse than no rules.
Rules that never trigger. If a glob matches zero files, the rule is dead weight.

The periodic prune

Set a cadence (monthly, or after major refactors) to audit your rules:

Remove rules whose referenced patterns no longer exist in the codebase
Merge overlapping rules
Split rules that have grown too broad
Promote notes that have become real patterns into rules

Part 3: Learn From the People Doing This Well

There’s a growing community of engineers sharing practical AI-driven development workflows. They’re worth your time because they’re grounded in real projects, not theory.

Matt Pocock

Why watch: Matt’s “AI Coding For Real Engineers” workshop is the best practical guide available. He’s a TypeScript expert who treats AI as a force multiplier on solid fundamentals, not a replacement for them.

Key ideas to absorb:

The grill-me pattern. Before writing code, have the agent ask you 40–100 questions until it shares your design concept. This closes the “AI didn’t do what I wanted” gap.
Plan → PRD → Slice → Ship. Turn the grilling session into a PRD, slice it into thin vertical “tracer bullets,” and ship each one independently.
Vertical slicing beats horizontal. One feature end-to-end (UI + API + DB) beats one layer across all features. Each slice gives the agent fast feedback.
Software fundamentals matter more, not less. Deep modules, ubiquitous language, feedback loops — these pay double in the AI age.

Watch: AI Coding For Real Engineers (96 min)

Lee Robinson

Why follow: Lee spent years leading developer experience at Vercel, then joined Cursor in 2025 to lead developer education. That makes him one of the clearest voices on how to actually drive an AI editor day to day — agent modes, background/cloud agents, and the workflow habits that separate fast practitioners from frustrated ones. Start with his writing and talks at leerob.com and Cursor’s own Learn material.

The AI Engineer community

Beyond individual creators, keep an eye on:

Anthropic’s engineering blog — especially “Building Effective Agents” and the Claude Code design posts. They publish the thinking behind the tools.
Andrej Karpathy’s interviews — his talks on code agents and the “loopy era” of AI. Key insight: the bottleneck is how often you have to reach back in, not raw model capability.
Hamel Husain & Shreya Shankar — their evals course is the best practical guide on measuring AI quality. “Evals are the new PRD.”

The common thread: none of these people treat AI as magic. They treat it as a tool that requires deliberate setup, clear constraints, and fast verification loops.

Part 4: Use Skills to Interview You and Set Up Project Prompts

One of the highest-leverage moves is using the AI before you start coding to establish a shared understanding of the project.

The project onboarding interview

Before writing any feature code, run an interview session:

"I'm going to describe a project to you. Before writing any code,
I want you to interview me about it. Ask me about:

1. The business domain and key concepts
2. The technical architecture and why those decisions were made
3. The coding conventions the team follows
4. The patterns that have caused problems before
5. What a successful first feature looks like

Keep asking questions until you can explain the project back to me
in a way I'd agree with."

This does two things:

It surfaces gaps in your own thinking (you’ll realize you haven’t decided things).
It builds context for the agent that’s grounded in your decisions, not generic defaults.

Turn the interview into project prompts

Take the interview output and bake it into your project configuration:

ARCHITECTURE.md — the mental model of the system
CONVENTIONS.md — coding standards and patterns
Rule files — specific behavioral guidance derived from what you discussed
Glossary — the ubiquitous language for your domain

The grill-me pattern (Matt Pocock)

For individual features, take it further:

"I want to build [feature]. Before you write any code, grill me on
the design. Ask me about edge cases, user flows, error handling,
performance requirements, and how it interacts with existing features.
Keep going until you understand it well enough to write a spec."

Then have the agent produce a PRD from that conversation. Then slice the PRD into issues the agent can pick up independently. This planning investment pays for itself in the first feature.

A challenge to try this week

Here’s a workflow we hand to engineers leveling up into AI-driven development. Run it a few times a week, at the end of your day:

Pick a big goal. Something ambitious — a feature or architecture improvement that feels like a reach. Aim bigger than you normally would.
Get interviewed. Open your agent (Cursor’s Composer, Codex, Claude Code) and say: “Share your leaning and interview me with multiple-choice questions to fill in the gaps for how we can meet this goal.” Go back and forth until the plan feels solid.
Generate the handoff. When the interview converges: “Based on this, write a full prompt with context — a process for how we can meet this goal.” That’s your lightweight PRD.
Hand off to a cloud agent. Start a fresh session (don’t reuse the interview context). Paste the PRD. Define success crisply: “I want a PR with tests that pass that accomplishes this goal.” Then hit go and log off.
Check in and iterate. Come back later and review. If it worked, great. If not, adjust the prompt — or the rules and docs — and rerun. It’s all throwaway code; the iteration is where you build intuition.

You’ll know it’s working when you’re attempting bigger features than you’d tackle alone, your prompts get sharper each round, and planning becomes the creative act while execution becomes the AI’s job.

The “explain it back” check

After any setup session, ask: “Explain our project conventions back to me.” If the agent’s explanation doesn’t match your intent, the rules need fixing — not the agent.

Part 5: Build a Feedback Loop With Tests

The rate of feedback is your speed limit. This was true before AI; it’s doubly true with AI. If your tests are slow, brittle, or nonexistent, the agent is flying blind.

Why tests matter more with AI

When you write code manually, you have a mental model of what it should do. You catch obvious mistakes before running tests. The agent doesn’t have that model — it runs on patterns and rules. Tests are how it knows if what it wrote works.

The principle: automation only saves time if verification is also fast. If the agent spends 2 minutes writing code and you spend 20 minutes verifying it, you’ve saved nothing.

Types of feedback loops

1. Static analysis (instant)

TypeScript strict mode
ESLint with project-specific rules
Prettier for formatting consistency
Set these up to run on save or pre-commit

2. Unit tests (seconds)

Write tests for business logic — the agent can run these after every change
Prefer tests that are fast, deterministic, and isolated
Mock external dependencies so tests don’t flake

3. Integration tests (seconds to minutes)

Test key flows end-to-end through your API
Use seeded test data (more on this in the preview environments section)
These catch the “wired it wrong” class of bugs

4. Visual / browser tests (minutes)

Screenshot comparison for UI changes
Only worth it for high-traffic surfaces
The agent can generate these, but they need human review

The TDD loop with AI

Matt Pocock’s workshop demonstrates this well:

Write the test first. Describe the expected behavior.
Have the agent implement. It runs the test, sees it fail, writes code to make it pass.
Verify. Run the test suite. If it passes, the agent’s output is verified at the unit level.
Manual QA for taste. Tests verify correctness. Humans verify “is this actually what we want?”

This last step is critical. Tests can tell you the code works. Only you can tell if it’s right for the product.

Making tests agent-friendly

Co-locate tests with source. Component.tsx → Component.test.tsx in the same directory. The agent doesn’t have to hunt for them.
Use descriptive test names. "adds item to cart and updates total" not "test1". The agent reads these to understand expected behavior.
Keep the test runner fast. If tests take 10 minutes, the agent can’t iterate. Under 30 seconds is the target.
Seed data, not fixtures. A script that creates realistic test data is more maintainable than hardcoded JSON files that rot.

Bonus 1: Preview Environments With Good Seeded Data

Every PR should be testable in isolation. Not “I’ll pull it down and check locally” — actually testable, with a URL you can open and click through.

Why this matters

Preview environments close the verification gap. Instead of reading code to decide if it’s correct, you use the feature. This is the highest-bandwidth feedback loop available.

Setting it up

The basic shape:

Each PR gets its own deployment. Vercel, Netlify, Railway, Cloudflare Pages — most platforms support this out of the box.
Each preview gets its own database. This is the part most teams skip, and it’s the most important. Use a seed script that creates:
- Realistic users with different roles/permissions
- Sample products/content matching your actual domain
- Orders/transactions in various states (pending, completed, failed)
- Edge cases already in the data (empty states, long text, special characters)
The seed script is part of the project. Not a one-time setup — it’s maintained alongside the schema. When the schema changes, the seed data changes with it.

The seed script pattern:

// scripts/seed-preview.ts
// Runs automatically when a preview environment spins up

async function seed() {
  // Create users
  const admin = await createUser({
    role: "admin",
    email: "admin-preview@test.com",
  });
  const customer = await createUser({
    role: "customer",
    email: "customer-preview@test.com",
  });

  // Create realistic catalog data
  const products = await createProducts(/** realistic product data */);

  // Create orders in various states
  await createOrder({ status: "pending", items: products.slice(0, 2) });
  await createOrder({ status: "completed", items: products.slice(2, 4) });
  await createOrder({ status: "failed", items: products.slice(4) });

  // Print test accounts
  console.log("Preview seeded:");
  console.log(`  Admin:    admin-preview@test.com / password123`);
  console.log(`  Customer: customer-preview@test.com / password123`);
}

What this unlocks:

Send a preview URL to a stakeholder and they can actually use it
The agent can run E2E tests against realistic data
You can verify edge cases without manual setup
PR reviews become “click the link” instead of “read the diff”

Cost management

Preview environments can get expensive. Mitigate with:

Auto-destroy previews when the PR closes
Shared database with namespaced data instead of per-preview databases for lower environments
Concurrency limits — only keep N previews alive at a time

Bonus 2: Remote Development Environments

The goal: send tasks to an agent and keep things going when you’re away from your desk. This requires an environment the agent can access and work in independently.

Option A: Your own machine

The setup:

An always-on machine (Mac Mini, homelab server, etc.) with your dev environment
SSH access (preferably via Tailscale or similar for security)
Docker for isolated workspaces
The agent runs in a Docker container with the repo mounted

The workflow:

You describe a task (from your phone, laptop, whatever)
The agent spins up a workspace in Docker
It works through the task — write code, run tests, iterate
It opens a PR when done
You review when you’re back at your desk

Key considerations:

Resource isolation. Docker prevents the agent from hogging your machine’s resources
Network access. Limit what the container can reach — it needs package registries and your APIs, not everything
Cleanup. Auto-destroy workspaces after inactivity to reclaim disk

Option B: Cloud-based agents (Cursor, Codex, etc.)

Cursor background agents:

Run tasks in Cursor’s cloud environment
Limited to Cursor’s ecosystem but zero infra management
Best for: Cursor users who want to send tasks from mobile

Codex (OpenAI):

Cloud-based agent that works in a sandboxed environment
Good for autonomous tasks — give it a prompt and a repo, let it run
Best for: well-specified tasks where you trust the output enough to review later

The general pattern regardless of platform:

Write a clear task description. The agent needs context, constraints, and a definition of done.
Specify the verification step. “Run the test suite and ensure all tests pass” or “Open the preview URL and verify the form submits.”
Set boundaries. “Only modify files in src/features/auth/” or “Don’t touch the database schema.”
Review the output. Always. Cloud agents are powerful, but you’re still the engineer.

One setup note worth repeating: the agent needs to be able to run code, execute tests, and create commits. If it only has read access to your repo, it can’t be truly agentic. Set up the cloud environment first.

Option C: Hybrid (recommended)

The most resilient setup combines both:

Local machine for sensitive repos, complex environments, or when you need full control
Cloud agents for well-scoped tasks, parallel work, or when you’re away from your machine
Preview environments as the bridge — both local and cloud agents push to PRs, and you verify through the same preview URL regardless of where the work happened

The Checklist

Before starting a new project — or leveling up an existing one — run through this:

Architecture documented. Can a new developer (or agent) understand the system from reading the docs?
Ubiquitous language established. Is domain vocabulary consistent across code and docs?
Rules configured. Does the project have behavioral rules that are short, specific, and earned?
Tests are fast. Can the test suite run in under 30 seconds?
Tests are co-located. Can the agent find the test for any file immediately?
Seed data exists. Can a preview environment spin up with realistic, usable data?
Preview environments work. Can you click a link on any PR and test the feature?
Remote environment available. Can you send tasks to an agent when you’re away from your desk?
Rules audited recently. Have you pruned stale, conflicting, or unnecessary rules in the last month?

The Mindset Shift

Moving to AI-driven engineering isn’t about replacing your skills — it’s about investing in the substrate that lets AI amplify them. The engineers who get the most out of AI aren’t the ones with the best prompts. They’re the ones with the best projects.

The project setup is the prompt. Your tests are the feedback. Your rules are the guardrails. Get those right, and the model becomes a force multiplier. Get them wrong, and it becomes a force multiplier for chaos. (For more on this stewardship mindset, see The Top Skill Engineers Should Be Developing Right Now.)

So back to where we started — the wrestling match. You don’t win it by gripping the prompt harder. You win it by building a project that carries your intent, so the next time you hand off a goal, the agent already knows how you’d expect it to operate.

Start with one thing. Make your rules better. Or add seed data to your previews. Or run a grill-me session on your next feature. The ROI is immediate and compounding.

Thanks to Matt Pocock, Lee Robinson, Andrej Karpathy, Hamel Husain, and Shreya Shankar for the thinking that informed this guide. Their talks and workshops are linked throughout.