Picture this: you queue up six user stories at 11 PM, close the laptop, and check the repo at 7 AM. Four stories pass. Two need refinement. The commit history is clean. No one prompted anything overnight.

Joe Fontaine, Udacity’s Principal Product Lead of AI Education, and AI Curriculum Lead Val Scarlata recently ran a live build session demonstrating exactly this workflow. The patterns in this article come directly from that session, the accompanying GitHub repo, and the Ralph Loops field guide we published alongside it.

That is what an autonomous AI coding agent actually does. Not autocomplete. Not a chatbot with a longer context window. A looped system that takes scoped stories, writes code, runs tests, checks pass/fail, commits progress, and keeps going without active human prompting.

Most people hear “autonomous AI coding agent” and picture either science fiction or a slightly fancier coding assistant. The reality is more mechanical and more useful. It is a workflow where the agent operates independently because the spec, the tests, and the environment give it everything it needs to make progress.

This article is not about model selection. It is about writing specs and acceptance criteria that an “amnesiac” agent can execute. Machine-checkable acceptance criteria are the unlock. The model is just the engine. The spec is the steering.

The shift from prompt-by-prompt experimentation to production-oriented autonomous workflows is where the leverage lives. Here is how to build one.

The interactive loop is a ceiling

If AI coding tools are so capable, why do they still need you hovering over every step?

The standard interactive workflow looks like this: you prompt, the model writes code, you inspect, you correct, it tries again. You are the continuity layer. You hold the context, remember the requirements, and catch the drift. The model does not.

Why chat-based coding feels fast, but stalls out

Autocomplete and chat-based coding assistance are genuinely useful for local tasks. Generating a utility function, writing a test, scaffolding a component. These are real productivity gains.

They break down on multi-step implementation. Persistent architecture decisions drift. Long-running feature work accumulates confusion. Every time you close the chat or switch context, you lose shared state and have to re-establish it. The tool is fast in bursts but stalls across anything requiring sustained coherence.

Context rot is the hidden tax

Joe uses the term context rot to describe what happens as an interactive session stretches longer. Earlier decisions blur. The model’s working memory becomes less reliable. Outputs start contradicting choices made 20 prompts ago.

The practical pain is familiar to anyone who has used these tools on real work: re-explaining requirements, correcting drift, losing trust in later outputs. The longer you go, the more you become the memory system. That is expensive. Not in tokens. In interruptions, cognitive load, and broken concentration.

The ceiling is human supervision

The bottleneck is not model intelligence alone. It is a workflow that requires you to validate and redirect at every turn. That does not scale past your attention span. It fragments deep work. It ties throughput to your availability.

If the workflow depends on constant prompting, it is not autonomous in any meaningful sense.

An autonomous AI coding agent removes that dependency. Not by being smarter, but by operating from explicit, testable criteria instead of conversational memory.

How Ralph works: the anatomy of the loop

Ralph is a practical working example of an autonomous AI coding agent. It picks up scoped work, implements it, tests it, commits it, and continues iterating. The loop, not the model alone, is what creates autonomy.

The 30-second mental model

Here is how Ralph works in plain terms:

  1. Ralph reads a PRD (product requirements document) containing user stories with pass/fail conditions.
  2. It picks the next unpassed story.
  3. It uses a coding tool to implement the work.
  4. It runs the harness and checks whether the story passes.
  5. It commits progress and moves to the next story.

That is the entire loop. Each iteration has a clear state boundary and a recoverable audit trail.

Ralph as a contractor, not a chat assistant

Val frames Ralph less like a pair programmer waiting for constant feedback and more like a contractor who gets a detailed scope of work, completes it, and checks against agreed deliverables.

A contractor needs three things: clear scope, acceptance criteria, and a way to verify completion. Without those, the contractor either stops or guesses. The same is true for an autonomous AI coding agent. The “copilot” analogy implies you are always in the seat. The contractor analogy is more accurate. You define the job. The agent executes it. You review the output.

Step-by-step anatomy of the loop

Here is the full sequence Ralph follows:

  1. Load the PRD and story list. The PRD is a structured JSON file with user stories, acceptance criteria, and pass/fail flags.
  2. Select the next unpassed story. Ralph checks which stories have not yet flipped to passes: true.
  3. Generate or modify code. The coding tool (Claude Code, Amp, or similar) writes the implementation.
  4. Run tests and validations through the harness. This includes structural tests, linters, build checks, and any custom validators.
  5. Mark pass/fail against explicit criteria. If the acceptance criteria are met, the story is marked as passing.
  6. Commit working progress to git. Each passing story creates a durable checkpoint in the repo.
  7. Repeat until stories are done or the iteration limit is reached.

Each loop iteration writes its own audit trail. If the agent fails or stalls, the last passing commit is the restore point.

Writing a PRD that an amnesiac agent can execute

Most people assume better models create better autonomy. In practice, clearer specs create better autonomy.

Spec as the new code

Joe frames this as spec as the new code. The shift is in where human leverage sits. Less time writing every function manually. More time defining outcomes, constraints, and pass/fail conditions.

This is a real career-relevant skill shift. High-value technical work increasingly means system design, decomposition, testing, and specification. Writing a precise PRD for an autonomous AI coding agent is not product management busywork. It is the engineering.

What an amnesiac agent needs in a PRD

Assume the agent remembers nothing except what is written down and what the system can test. If success depends on unwritten intent, the workflow will break. Every PRD should include:

  • User story ID for tracking and reference
  • Title and objective stating what the story accomplishes
  • Scope boundaries defining what is in and out of bounds
  • Explicit acceptance criteria written as testable conditions
  • File or system constraints specifying where changes should happen
  • Dependencies or prerequisites listing what must exist first
  • Definition of done that can be verified by the harness

Good acceptance criteria read like test assertions:

  • “When user clicks Save, preferences persist after refresh”
  • “Endpoint returns 401 for unauthenticated requests”
  • “Component renders loading state and empty state”

Each one is binary. Pass or fail. No interpretation needed.

Story sizing is the entire art

Oversized stories create drift and compound failure. The agent loses coherence across too many changes. Undersized stories create overhead and fragmentation. The loop spends more time on context switching than implementation.

The right size for a story: one clear outcome, one bounded implementation area, one testable success condition set.

Bad story: “Improve onboarding UX”

Better story: “Add step indicator, client-side validation, and success redirect for the three-step signup flow”

What bad PRDs sound like

Ambiguity forces human rescue. That defeats the point of an autonomous workflow.

  • Vague: “Make the dashboard faster”
  • Better: “Lazy-load chart components so initial dashboard render completes in under 2 seconds with no data-fetch blocking”
  • Vague: “Handle errors properly”
  • Better: “Display inline validation messages for empty required fields on submit; show toast notification for server errors with retry option”

A weak spec creates weak autonomy. Every ambiguous requirement is a future interruption where someone has to step in and clarify what “properly” means.

Running it: what to watch while it works

You start the loop with a single command and then nothing. That is the point.

Start the loop

./scripts/ralph/ralph.sh --tool claude 20
./scripts/ralph/ralph.sh

The first command runs with Claude Code, setting max iterations to 20. The second runs with Amp using the default 10 iterations. The number parameter is max iterations. Set it to the number of stories plus a small buffer.

Watch these four things in parallel

While the loop runs, these four signals tell you everything you need:

  • Terminal output: Agent activity streams live via tee /dev/stderr. This shows liveness. If it stops streaming, the agent has stalled or exited.
  • Story pass state: cat prd.json | jq '.userStories[] | {id, title, passes}' shows which stories have flipped to true. This is your progress metric.
  • Commit history: git log --oneline -10 shows durable state changes. Each commit represents work the agent considered complete.
  • Accumulated learnings: tail -f progress.txt reveals emerging lessons and decisions the agent is recording as it works.

You do not need to watch every line. You check in periodically and let the signals do the reporting.

What failure looks like and how to recover

Loops can exit mid-story. Token exhaustion, auth timeouts, network interruptions. This is not a crisis. It is an expected operational state.

Git history is the audit log. Find the last commit where stories were passing. That is the restore point. Ask Claude to clean up any dirty uncommitted changes. Reset to that commit. Restart the loop.

The entire recovery process is: identify the last good state, clean up, restart. No panic required.

Review the product, not the code

As Val puts directly: the skill to develop is not obsessively reading every generated line. It is learning when to trust the loop and when to intervene.

After every 4 to 5 stories, spot-check the running app. If the output looks right, keep going. If architecture feels off, review structurally. Do not review line by line or commit by commit. That reintroduces the same human-in-the-loop bottleneck the autonomous AI coding agent workflow is designed to eliminate.

Trust the harness. That is what it is there for.

Harness engineering: the discipline beneath the magic

Many readers will assume the hard part is model prompting. The hard part is building the environment that can evaluate work reliably. Joe describes this as the harness: the rail system that keeps the agent from operating on vibes.

What the harness does

The harness defines what can be changed, how success is measured, what failure blocks progress, and what gets committed. Without it, an autonomous AI coding agent is just generating code into the void. With it, every loop iteration produces a verifiable outcome.

A practical harness framework

  1. Scope controls. Limit where the agent can work and which files or surfaces matter. Prevent unintended changes to unrelated systems.
  2. Validation layer. Run tests, linters, schema checks, build checks, or fixture comparisons after each implementation step.
  3. State visibility. Store progress in PRD flags, git commits, and progress logs so the current state is always recoverable.
  4. Recovery path. Make rollback and restart straightforward. Git provides this natively when commits are frequent and atomic.
  5. Human review gates. Define when structural review is required. Not after every commit. After meaningful milestones.

Why structural tests beat behavioral tests

Behavioral tests verify broad outcomes. They can be noisy, flaky, and ambiguous. A behavioral test that says “user can complete checkout” depends on many moving parts and can pass or fail for reasons unrelated to the code under test.

Structural tests verify specific outputs: files exist, interfaces match, schemas conform, routes are registered, configs are updated. They give the agent a clear pass/fail signal with minimal ambiguity.

For autonomous loops, structural checks are better because they produce less false confidence. Examples:

  • API contract shape matches the spec
  • Component exists and accepts required props
  • Migration file was created with correct columns
  • Route is registered in the router config
  • Environment variable is referenced in the expected config file

When not to use Ralph

Consider a tangled legacy codebase with unclear ownership, no reliable test suite, and requirements scattered across Slack threads. Running an autonomous loop against that project is a recipe for compounding confusion, not shipping features.

Greenfield vs brownfield

Greenfield projects are cleaner for an autonomous AI coding agent because scope, structure, and tests can be designed upfront. You control the architecture. You define the patterns. The agent follows them.

Brownfield projects introduce hidden dependencies, inconsistent patterns, and undocumented constraints. Ralph is not useless in brownfield contexts. But it becomes more dependent on strong harnessing, narrower story scope, and more frequent human checkpoints.

Where Ralph fits and where it struggles

ScenarioFitRecommended human involvement
New feature in a greenfield appStrong. Clean scope, fresh architecture.Review after every 4-5 stories.
CRUD extension with clear testsStrong. Repetitive, well-bounded work.Light monitoring.
UI polish with explicit component requirementsGood. Testable visual contracts.Spot-check rendered output.
Legacy monolith refactorWeak. Hidden dependencies, no reliable tests.Heavy. Human-led scoping per story.
Cross-service architecture redesignWeak. Requires judgment across system boundaries.Human-driven. Agent assists locally.
Security-sensitive production changePoor fit without guardrails.Full human review required.
Highly ambiguous product explorationPoor fit. No clear pass/fail criteria.Human-led discovery first.

AI agent vs human developer: what changes

Efficiency: Agents can be faster on scoped, repetitive, testable implementation. They do not get tired. They do not context-switch. They execute the loop at machine speed.

Creativity: Humans still lead on ambiguous product judgment, architecture invention, and tradeoff calls. Agents execute within constraints. They do not question whether the constraints are right.

Error rates: Agents may produce more silent or plausible-looking mistakes unless constrained by tests. A human developer notices when something “feels wrong” architecturally. An agent does not. The harness compensates for this, but only if it is well-built.

This is not a replacement story. It is a division of labor. Humans scope, design, and review. The autonomous AI coding agent implements, tests, and commits.

Signs you should not run it overnight

  • No reliable test suite to validate against
  • Vague or unwritten requirements
  • Unclear architecture boundaries
  • Regulated or security-critical changes
  • Dependencies on stakeholder interpretation mid-build

Where to go next

Everything in this article came out of a single live build session. Here is what Val and Joe built, and everything they used to build it.

Start with the repo

GitHub repo: https://github.com/udacity/agentic-loop-webinar

This is the fastest on-ramp to a working autonomous AI coding agent setup. It includes ralph.sh, PRD examples, the CLAUDE.md prompt, and starter files. Clone it and you have a working harness in minutes.

Watch the full live build

Webinar recording: https://www.youtube.com/watch?v=S1VWe_JvpHo&t=1s

Watch the loop run live. See stories flipping to passes: true in real time. This makes the workflow concrete in a way that reading about it cannot.

Use the field guide as your reference sheet

The Ralph Loops: A Field Guide PDF is a concise reference covering anatomy, quick start, PRD writing, safety, and anti-patterns. Keep it open while you run your first loop.

Reframe the scale question

The common question is “Can this handle my project?” As Joe reframes it: it is not mainly about scale of problem. It is about how well the problem can be broken into appropriately scoped stories. A large project broken into 40 well-written stories is a better fit than a small project with 3 vague ones.

Keep building deeper agentic skills

Ralph loops run on top of a broader set of agentic patterns. The Agentic AI Nanodegree program covers the full stack: workflow patterns, tool integration, state and memory design, and multi-agent orchestration.

Conclusion

The real power of an autonomous AI coding agent comes from scoped stories, machine-checkable criteria, and a reliable harness. Not from a magical model. Not from a longer context window.

Building this workflow is less about finding the right AI and more about designing a system that can work without your memory. Writing good acceptance criteria is now a core technical skill. Story decomposition, structural testing, and harness engineering are where the leverage sits.

Spec discipline is the unlock, not the model.

In the AI economy, the professionals who stand out will not just know how to prompt. They will know how to structure work so AI systems can execute it reliably, from experiment to production. The Agentic AI Nanodegree program is built to develop exactly those skills.

Valerie Scarlata
Valerie Scarlata
Valerie “Val” Scarlata is the AI Curriculum Lead at Udacity. Before shifting focus to AI, Val spent two decades teaching web and mobile development and was a Software Engineer at Hello World and an Organizer for Google Developers Group prior to joining Udacity in 2021.