How to Build an AI Agent: Step-by-Step with Python

Last Updated on June 29, 2026

You write a Python script that calls an LLM. You give it a system prompt, feed in a user question, and get a response back. It looks like an agent. It sounds like an agent. Then someone asks it to check whether an order shipped, and it confidently makes up a tracking number.

Most tutorials on building AI agents stop right there. They show you a single model call wrapped in a function, maybe with a prompt template, and call it an agent. That is a chatbot. A useful one, sometimes. But not an agent.

The confusion comes from the term itself. “Agent” gets used loosely across documentation, blog posts, and product marketing. It can mean anything from a customer support bot to an autonomous research system. That vagueness is where most learners get stuck.

Here is the clearer frame: if you want to understand how to build an AI agent, focus on architecture, not labels. A real agent is not just an LLM call. It needs three parts working together:

A reasoning loop that plans, acts, observes results, and revises
Tool use that lets the system access live data or take real actions
State management that tracks what happened, what is next, and when to stop

Model choice matters. Framework choice matters. But architecture matters more. This article walks through all three layers, then combines them into a minimal working agent in Python.

What separates an agent from a chatbot

A chatbot handles one request in one turn. You send a prompt, you get a response. That is it. The system does not check its own answer, look up real data, or decide it needs more information before responding.

An agent does something structurally different. It can plan multiple steps, call external tools, evaluate results, and adjust its approach before producing a final answer. The difference is not “more intelligence” in some abstract sense. It is system design.

A useful mental model for building AI agents comes down to three layers:

Reasoning loop: lets the system plan, act, observe results, and revise its approach
Tool use: lets the system access information or take actions outside the model’s internal knowledge
State management: lets the system track what happened, what is next, and when to stop

A single LLM call is not enough to count as a real agent in most practical settings. Without these three layers, you are building a prompt wrapper, not an agent.

To see this come together, it helps to watch one get built. In a recent live session, Udacity instructor Peter Kowalchuk wrote an agent from scratch in Python, starting with a single model call and adding one capability at a time until it became a working agent. We’ll reference this build throughout this guide:

The three-layer definition at a glance

All three layers need to work together. A reasoning loop without tools has no way to verify anything. Tools without state management have no way to track what already ran. A state machine without a reasoning loop is just a hardcoded script.

Remove any one layer and the system collapses back into a scripted assistant or a plain chatbot.

A minimal agent stack: Reasoning loop + Tool use + State management

Peter’s build is a good way to see why these layers exist. He starts with the simplest possible thing — one call to a language model — and shows it has no scope, no continuity, and no way to act. He then adds a persona to give it a role, knowledge to ground it, memory so it carries context between turns, and tools so it can do something — and only then hands it a goal rather than a single question. The three layers below are the architecture underneath that progression.

Layer 1, the reasoning loop: plan, act, observe, revise

What happens when an agent gets a question it cannot answer from internal knowledge alone? Or when the first tool call returns an error? Without a reasoning loop, the system either guesses or fails silently.

The reasoning loop is what turns static text generation into behavior over time. It gives the agent a process for working through multi-step tasks:

Plan: Decide what to do next based on the current goal and available information
Act: Call a tool or produce an intermediate action
Observe: Inspect the result of that action
Revise: Update the plan based on what was learned, then decide the next step

This is not just “feedback.” It is an execution loop with branching decisions. The system might change direction entirely based on what it observes. It might call a different tool. It might determine it has enough information to stop.

This pattern matters for tasks like researching a vendor, comparing product specs across sources, or checking a database before drafting a response. Any task that requires more than one step to complete reliably needs some form of this loop.

“Agentic RAG isn’t just about tools, it’s about behavior.” Retrieval alone is not the point. The system’s value comes from how it decides when to retrieve, what to do with the result, and whether to continue or stop.

Example of the reasoning loop in a real task

Consider a Python-based support agent that needs to answer: “Is order #4821 delayed, and what should I tell the customer?”

Plan: Look up order status in the order management system
Act: Query the order API with order ID #4821
Observe: Status returns “delayed,” but no estimated arrival date is included
Revise: The answer is incomplete. Call the shipping carrier API for transit details, then check the internal knowledge base for the company’s delay communication policy
Final response: Draft a message to the customer explaining the delay with an accurate next step and expected resolution window

Without the reasoning loop, the agent would have stopped after the first API call and either reported “delayed” with no context or hallucinated an arrival date.

Common mistake to call out

Many tutorials call a chain of prompts a “reasoning loop.” That is misleading. A real loop reacts to results and can change course. If nothing is observed and revised, if the system just executes a fixed sequence of prompts regardless of output, it is closer to a scripted workflow than a robust agent.

Layer 2, tool use: giving the agent access to the real world

An agent that cannot look anything up or take any action is limited to whatever the model learned during training. That means no current prices, no live order statuses, no real calculations, no ability to create a ticket or send an update.

“Let’s talk about one of the most exciting things you can do with large language models, giving them superpowers.” The framing of “superpowers” is practical, not hype. The model becomes useful because it can do more than autocomplete text. It can gather evidence, verify facts, and take actions in external systems.

Building AI agents requires tool access because:

Models are limited by their training data cutoff
Models can hallucinate facts, numbers, and references
Real work depends on external systems like databases, APIs, and business tools

There is a meaningful difference between generating an answer and gathering evidence. Tools move the agent from guessing to grounding.

Tool examples for a minimal Python agent

Tool type	What it does	Example Python use	Why it matters for building AI agents	Tradeoff or risk
Web search API	Fetches current information	Search recent docs, prices, or headlines	Lets the agent work with live data	Search quality varies by provider
Database query	Retrieves structured records	Look up orders, users, or inventory	Grounds answers in system data	Needs access control and query limits
Calculator/function	Performs reliable computation	Tax, totals, date math	Better than asking the model to estimate	Narrow capability per function
Internal API	Takes real actions	Create a ticket or send an update	Moves from advice to execution	Requires safeguards and permissions
Code execution tool	Runs snippets or transforms data	Parse CSV, summarize logs	Useful for data-heavy tasks	Sandbox and security concerns

When tool use changes the design

Once tools are involved, the system is no longer “just prompting.” Tools introduce software engineering concerns:

Structured inputs: Tools need well-defined parameters, not freeform text
Output validation: Tool responses must be checked before the agent acts on them
Retry logic: APIs fail. The agent needs to handle timeouts and errors
Access controls: Not every tool should be available to every agent or every user

This is the point where building AI agents starts to feel less like prompt engineering and more like systems engineering. That shift is intentional.

Layer 3, state management: why LLMs are stateless and agents cannot be

State and memory sound similar. They are not the same thing, and confusing them is one of the most common sticking points in building AI agents.

“LLMs are stateless, but agents need state to manage execution.” Each call to a language model is independent. The model does not remember what it said last time unless you explicitly pass prior context into the prompt. An agent, by contrast, needs to know where it is in a multi-step process.

Agents need state to track:

The current task and user query
Which steps have been completed
Tool outputs collected so far
Pending actions
Stop conditions (max steps, success criteria, error thresholds)

State vs memory

These terms get used interchangeably, but they describe different things:

State = execution status for the current run. Example: “waiting for search result”
Memory = stored information that may persist across runs. Example: “user prefers concise summaries”

State is about the current task. Memory is about accumulated knowledge. Most agents need state. Not all agents need memory.

Simple state machine for a minimal agent

A state machine defines where the agent can be, what events move it forward, and when it stops. Here is a minimal set of states:

IDLE → PLANNING → CALLING_TOOL → EVALUATING_RESULT → RESPONDING → DONE
                       ↓                  ↓
                     ERROR          ERROR

IDLE: Waiting for a task
PLANNING: Deciding the next action
CALLING_TOOL: Executing a tool function
EVALUATING_RESULT: Inspecting the tool’s output
RESPONDING: Generating the final answer
DONE: Task complete
ERROR: Timeout, invalid tool output, or max steps reached

What moves the agent between states? Events like “tool returned a result,” “result is insufficient, re-plan,” or “max steps exceeded, stop.” Timeouts, invalid outputs, and step limits push the agent to ERROR or DONE to prevent infinite loops.

Putting it together: a minimal working agent in Python

This section combines all three layers into one small but functional example. The goal is a research assistant agent that answers a question using a web search tool and a calculator.

The example below is deliberately minimal. If you want a complete, runnable version that follows the same progression — model call, persona, memory, tools, and a multi-agent workflow — Peter’s full code from the live build is public on GitHub: github.com/udacity/agentic-ai-from-scratch-webinar.

Step 1: define the task and the agent’s boundaries

Strong agents start with narrow scope. This agent can:

Search the web for information
Perform basic calculations
Synthesize results into a final answer

It cannot: send emails, modify databases, or take any real-world action. It has a maximum of five reasoning steps before it must return whatever it has.

Step 2: define the tools

Tools are Python functions with clear input/output contracts.

def search_web(query: str) -> str:
    """Search the web and return a summary of results."""
    # In production, this wraps a real search API with auth and error handling
    # Simulated for this example
    if "population" in query.lower():
        return "According to recent data, Tokyo has approximately 14 million residents in the city proper."
    return f"No strong results found for: {query}"

def calculate(expression: str) -> str:
    """Evaluate a math expression and return the result."""
    try:
        result = eval(expression)  # Use a safe parser in production
        return str(result)
    except Exception as e:
        return f"Calculation error: {e}"

In production, these would wrap real APIs and need authentication, error handling, rate limiting, and logging. The structure matters more than the implementation details here.

Step 3: create a simple state object

This is the piece many beginner tutorials leave out. Without explicit state, the agent has no way to track its own progress.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class AgentState:
    user_query: str
    current_step: int = 0
    max_steps: int = 5
    observations: list = field(default_factory=list)
    selected_tool: Optional[str] = None
    final_answer: Optional[str] = None
    status: str = "PLANNING"  # PLANNING, CALLING_TOOL, EVALUATING, DONE, ERROR

Step 4: implement the reasoning loop

The loop decides whether to call a tool, continue reasoning, or stop. It uses a lightweight decision schema from the model.

import json

TOOLS = {"search_web": search_web, "calculate": calculate}

def get_model_decision(state: AgentState) -> dict:
    """Ask the LLM what to do next. Returns a structured decision."""
    # In production, this calls your LLM API with the current state
    # The prompt includes the user query, observations so far, and available tools
    # Expected response format:
    # {"action": "call_tool" | "final_answer",
    #  "tool_name": "search_web" | "calculate",
    #  "tool_input": "...",
    #  "reasoning": "..."}
    
    # Simulated decision logic for this example
    if not state.observations:
        return {
            "action": "call_tool",
            "tool_name": "search_web",
            "tool_input": state.user_query,
            "reasoning": "No information gathered yet. Searching for relevant data."
        }
    return {
        "action": "final_answer",
        "tool_name": None,
        "tool_input": None,
        "reasoning": "Sufficient information gathered to answer."
    }

Step 5: handle tool results and update state

Tool calls do not always succeed. The agent needs to handle failures gracefully and update state regardless of outcome.

def execute_tool(state: AgentState, decision: dict) -> str:
    """Run the selected tool and return its output."""
    tool_name = decision["tool_name"]
    tool_input = decision["tool_input"]
    
    if tool_name not in TOOLS:
        return f"Error: Unknown tool '{tool_name}'"
    
    try:
        result = TOOLS[tool_name](tool_input)
        return result
    except Exception as e:
        return f"Tool execution failed: {e}"

Step 6: return a final answer

The finish point is explicit. The agent generates a final response grounded in its observations, not from the model’s internal knowledge alone.

def generate_final_answer(state: AgentState) -> str:
    """Produce the final answer based on collected observations."""
    # In production, pass observations to the LLM for synthesis
    obs_text = "\n".join(state.observations)
    return f"Based on research: {obs_text}"

Minimal Python example, full flow

def run_agent(user_query: str) -> str:
    """Run the agent loop: plan, act, observe, revise."""
    state = AgentState(user_query=user_query)
    
    while state.status not in ("DONE", "ERROR"):
        # Guard against infinite loops
        if state.current_step >= state.max_steps:
            state.status = "ERROR"
            return "Max steps reached. Partial observations: " + str(state.observations)
        
        # PLAN: ask the model what to do
        state.status = "PLANNING"
        decision = get_model_decision(state)
        
        if decision["action"] == "final_answer":
            # RESPOND: generate and return the answer
            state.status = "DONE"
            state.final_answer = generate_final_answer(state)
            return state.final_answer
        
        # ACT: call the selected tool
        state.status = "CALLING_TOOL"
        state.selected_tool = decision["tool_name"]
        result = execute_tool(state, decision)
        
        # OBSERVE: record the result
        state.observations.append(f"[{decision['tool_name']}] {result}")
        state.current_step += 1
        
        # REVISE: loop back to planning with updated state
        state.status = "PLANNING"
    
    return state.final_answer or "Agent completed without producing an answer."

# Example invocation
answer = run_agent("What is the population of Tokyo?")
print(answer)

This code is not production-grade. It is designed to make the architecture visible. Every component maps to one of the three layers: the while loop is the reasoning loop, TOOLS is the tool layer, and AgentState is the state management layer.

What this minimal agent still lacks

Honest accounting of what would need to change for production use:

Authentication and secrets management for real API keys
Schema validation on tool inputs and outputs
Monitoring and tracing to debug failures across steps
Retries with backoff for flaky tool calls
Human approval gates for high-impact actions
Evaluation benchmarks to measure agent accuracy over time
Cost monitoring to track LLM API spend per run

These are engineering concerns, not AI concerns. That distinction matters for career planning.

Design choices that matter more than model choice

Most failures in early agent systems come from architecture problems, not model limitations:

Poor tool design with ambiguous inputs or outputs
Missing state logic that lets the agent lose track of progress
Unclear stop conditions that cause infinite loops or premature exits
Weak error handling that silently swallows tool failures

A few architecture tradeoffs worth considering:

One general tool vs many narrow tools: Narrow tools are easier to test and validate. General tools are more flexible but harder to make reliable.
Short loop with tight constraints vs open-ended exploration: Tight constraints are easier to debug and cheaper to run. Open-ended loops handle more complex tasks but need stronger guardrails.
Explicit state machine vs ad hoc state in prompt text: State machines are more predictable. Prompt-based state is more flexible but harder to inspect and test.

The direct recommendation: start with the simplest architecture that can complete the task reliably. Add complexity only when you have evidence that the simple version fails.

It’s also worth building an agent from scratch at least once. Peter’s session skipped frameworks like LangChain on purpose — not because they’re wrong, but because writing the loop by hand shows you exactly what those frameworks abstract away. That’s what lets you judge when a framework is helping and when it’s hiding a problem.

When to go multi-agent, and when not to

Think of a multi-agent system like a restaurant: hosts greet customers, waiters take orders, chefs cook, and managers oversee operations. Each agent has a defined role. Coordination happens through clear handoffs, not one system trying to do everything.

Multi-agent patterns help when tasks naturally decompose into distinct roles:

One agent retrieves and gathers data
One evaluates quality or checks for errors
One drafts the final output
One routes, supervises, or decides escalation

But multi-agent is not the default answer. It adds orchestration complexity, more failure points, and harder evaluation.

Single agent vs multi-agent comparison

Approach	Best for	Strengths	Tradeoffs	Recommendation
Single agent	Narrow workflows, early prototypes, well-bounded tasks	Simpler design, easier debugging, lower orchestration overhead	Can become overloaded on complex workflows	Start here by default
Multi-agent system	Complex workflows with distinct roles or parallel work	Specialization, modularity, clearer role separation	More orchestration, more failure points, harder evaluation	Use only when role separation clearly improves outcomes

A good rule of thumb

If one agent with good tools and clear state can complete the job, keep it single-agent. Add multiple agents only when responsibilities are clearly separable, specialization measurably improves quality, and the orchestration cost is worth the outcome.

As Peter framed it in the build: you wouldn’t move to a new city and hire one person who knows every neighborhood, every house, and every kind of buyer — you’d talk to a few specialists. Multi-agent systems work the same way. Split responsibilities the way you’d staff a team, and only when the goal actually calls for it.

Real-world deployment lessons: what breaks first

Early failures in AI agent deployments usually come from systems issues, not model intelligence. The model can reason well enough. The surrounding system falls apart.

Common failure patterns:

Tool failures: APIs time out, return unexpected formats, or hit rate limits
Bad permissions: The agent tries to access a resource it is not authorized to reach
Infinite loops: Missing stop conditions or poorly defined success criteria
Weak monitoring: No visibility into what the agent did or why it made a decision
Ambiguous task definitions: The agent cannot determine when it has answered the question

Moving from experiment to production means treating agents like software systems, not just prompts. Teams need logging and traces, retry strategies, guardrails on outputs, human fallback paths for unclear cases, and cost monitoring per run.

Deployment pattern examples

Support triage agent: Routes incoming tickets to the correct team based on content analysis. Uses a classification tool and a knowledge base lookup. Tracks ticket state through RECEIVED → CLASSIFIED → ROUTED → CONFIRMED. Guardrails: escalates to a human when confidence is below threshold. Logs every routing decision for audit.

Competitor research agent: Gathers product information from public sources, compares specs, and drafts a summary. Uses web search and a structured data parser. Tracks which competitors have been researched and which comparisons are complete. Guardrails: flags any claim that cannot be traced to a source URL. Requires human review before the summary is shared externally.

These are the kinds of skills that matter in the AI job landscape. Not just prompting a model, but designing systems that handle real operational requirements.

Future trends in building AI agents

Progress in agentic AI is likely to come from better orchestration, observability, and evaluation rather than just larger models. Several directions are gaining traction:

Better tool routing: Models are improving at selecting the right tool and structuring inputs correctly
Stronger structured outputs: More reliable JSON and schema-conformant responses reduce parsing errors
More reliable evaluation: New benchmarks and frameworks for measuring agent performance on multi-step tasks

The professionals who can move systems from experiment to production will be valuable to employers looking to operationalize AI at scale. That means understanding not just how to call a model, but how to design the system around it.

Conclusion

If you want to understand how to build an AI agent, do not start with hype or framework comparisons. Start with the three layers: a reasoning loop that can plan and revise, tools that connect the agent to real data and actions, and state management that tracks execution from start to finish.

Even a minimal Python agent built on these principles teaches the right architecture. The code in this article is deliberately simple. The patterns it demonstrates are the same ones that scale into production systems.

Building AI agents with Python is a skill that combines software engineering, system design, and applied AI. It is not about knowing one framework or one model. It is about understanding how to make them work together reliably.

Ready to go deeper?

If you want structured practice moving from agent concepts to working systems, and to learn more from Peter Kowalchuk, the Agentic AI program is the logical next step. It covers building practical agent workflows, working with tools and orchestration, and developing the production-relevant judgment that separates demos from deployed systems. You can also explore the full catalog for programs across AI, data, and software engineering.

Schools

Popular

Featured

How to Build an AI Agent: Step-by-Step with Python

What separates an agent from a chatbot

The three-layer definition at a glance

Layer 1, the reasoning loop: plan, act, observe, revise

Example of the reasoning loop in a real task

Common mistake to call out

Layer 2, tool use: giving the agent access to the real world

Tool examples for a minimal Python agent

When tool use changes the design

Layer 3, state management: why LLMs are stateless and agents cannot be

State vs memory

Simple state machine for a minimal agent

Putting it together: a minimal working agent in Python

Step 1: define the task and the agent’s boundaries

Step 2: define the tools

Step 3: create a simple state object

Step 4: implement the reasoning loop

Step 5: handle tool results and update state

Step 6: return a final answer

Minimal Python example, full flow

What this minimal agent still lacks

Design choices that matter more than model choice

When to go multi-agent, and when not to

Single agent vs multi-agent comparison

A good rule of thumb

Real-world deployment lessons: what breaks first

Deployment pattern examples

Future trends in building AI agents

Conclusion

Ready to go deeper?

Popular Nanodegrees

Programming for Data Science with Python

Data Scientist Nanodegree

Self-Driving Car Engineer

Data Analyst Nanodegree

Android Basics Nanodegree

Intro to Programming Nanodegree

AI for Trading

Predictive Analytics for Business Nanodegree

AI For Business Leaders

Data Structures & Algorithms

School of Artificial Intelligence

School of Cyber Security

School of Data Science

School of Business

School of Autonomous Systems

School of Executive Leadership

School of Programming and Development

Related Articles

What is the Claude Agent SDK, and why are engineers building their own harnesses?

Claude Code Best Practices: How I actually use Claude Code as a senior cloud architect

The Claude Certified Architect Exam, Explained by Someone Who Passed It

Prompt chaining explained: how to build reasoning pipelines in Python