Agentic AI

How to Build an AI Agent: Step-by-Step with Python

You write a Python script that calls an LLM. You give it a system prompt, feed in a user question, and get a response back. It looks like an agent. It sounds like an agent. Then someone asks it to check whether an order shipped, and it confidently makes up a tracking number.

Most tutorials on building AI agents stop right there. They show you a single model call wrapped in a function, maybe with a prompt template, and call it an agent. That is a chatbot. A useful one, sometimes. But not an agent.

The confusion comes from the term itself. “Agent” gets used loosely across documentation, blog posts, and product marketing. It can mean anything from a customer support bot to an autonomous research system. That vagueness is where most learners get stuck.

Here is the clearer frame: if you want to understand how to build an AI agent, focus on architecture, not labels. A real agent is not just an LLM call. It needs three parts working together:

  • A reasoning loop that plans, acts, observes results, and revises
  • Tool use that lets the system access live data or take real actions
  • State management that tracks what happened, what is next, and when to stop

Model choice matters. Framework choice matters. But architecture matters more. This article walks through all three layers, then combines them into a minimal working agent in Python.

What separates an agent from a chatbot

A chatbot handles one request in one turn. You send a prompt, you get a response. That is it. The system does not check its own answer, look up real data, or decide it needs more information before responding.

An agent does something structurally different. It can plan multiple steps, call external tools, evaluate results, and adjust its approach before producing a final answer. The difference is not “more intelligence” in some abstract sense. It is system design.

A useful mental model for building AI agents comes down to three layers:

  • Reasoning loop: lets the system plan, act, observe results, and revise its approach
  • Tool use: lets the system access information or take actions outside the model’s internal knowledge
  • State management: lets the system track what happened, what is next, and when to stop

A single LLM call is not enough to count as a real agent in most practical settings. Without these three layers, you are building a prompt wrapper, not an agent.

The three-layer definition at a glance

All three layers need to work together. A reasoning loop without tools has no way to verify anything. Tools without state management have no way to track what already ran. A state machine without a reasoning loop is just a hardcoded script.

Remove any one layer and the system collapses back into a scripted assistant or a plain chatbot.

A minimal agent stack: Reasoning loop + Tool use + State management

Layer 1, the reasoning loop: plan, act, observe, revise

What happens when an agent gets a question it cannot answer from internal knowledge alone? Or when the first tool call returns an error? Without a reasoning loop, the system either guesses or fails silently.

The reasoning loop is what turns static text generation into behavior over time. It gives the agent a process for working through multi-step tasks:

  1. Plan: Decide what to do next based on the current goal and available information
  2. Act: Call a tool or produce an intermediate action
  3. Observe: Inspect the result of that action
  4. Revise: Update the plan based on what was learned, then decide the next step

This is not just “feedback.” It is an execution loop with branching decisions. The system might change direction entirely based on what it observes. It might call a different tool. It might determine it has enough information to stop.

This pattern matters for tasks like researching a vendor, comparing product specs across sources, or checking a database before drafting a response. Any task that requires more than one step to complete reliably needs some form of this loop.

“Agentic RAG isn’t just about tools, it’s about behavior.” Retrieval alone is not the point. The system’s value comes from how it decides when to retrieve, what to do with the result, and whether to continue or stop.

Example of the reasoning loop in a real task

Consider a Python-based support agent that needs to answer: “Is order #4821 delayed, and what should I tell the customer?”

  • Plan: Look up order status in the order management system
  • Act: Query the order API with order ID #4821
  • Observe: Status returns “delayed,” but no estimated arrival date is included
  • Revise: The answer is incomplete. Call the shipping carrier API for transit details, then check the internal knowledge base for the company’s delay communication policy
  • Final response: Draft a message to the customer explaining the delay with an accurate next step and expected resolution window

Without the reasoning loop, the agent would have stopped after the first API call and either reported “delayed” with no context or hallucinated an arrival date.

Common mistake to call out

Many tutorials call a chain of prompts a “reasoning loop.” That is misleading. A real loop reacts to results and can change course. If nothing is observed and revised, if the system just executes a fixed sequence of prompts regardless of output, it is closer to a scripted workflow than a robust agent.

Layer 2, tool use: giving the agent access to the real world

An agent that cannot look anything up or take any action is limited to whatever the model learned during training. That means no current prices, no live order statuses, no real calculations, no ability to create a ticket or send an update.

“Let’s talk about one of the most exciting things you can do with large language models, giving them superpowers.” The framing of “superpowers” is practical, not hype. The model becomes useful because it can do more than autocomplete text. It can gather evidence, verify facts, and take actions in external systems.

Building AI agents requires tool access because:

  • Models are limited by their training data cutoff
  • Models can hallucinate facts, numbers, and references
  • Real work depends on external systems like databases, APIs, and business tools

There is a meaningful difference between generating an answer and gathering evidence. Tools move the agent from guessing to grounding.

Tool examples for a minimal Python agent

Tool typeWhat it doesExample Python useWhy it matters for building AI agentsTradeoff or risk
Web search APIFetches current informationSearch recent docs, prices, or headlinesLets the agent work with live dataSearch quality varies by provider
Database queryRetrieves structured recordsLook up orders, users, or inventoryGrounds answers in system dataNeeds access control and query limits
Calculator/functionPerforms reliable computationTax, totals, date mathBetter than asking the model to estimateNarrow capability per function
Internal APITakes real actionsCreate a ticket or send an updateMoves from advice to executionRequires safeguards and permissions
Code execution toolRuns snippets or transforms dataParse CSV, summarize logsUseful for data-heavy tasksSandbox and security concerns

When tool use changes the design

Once tools are involved, the system is no longer “just prompting.” Tools introduce software engineering concerns:

  • Structured inputs: Tools need well-defined parameters, not freeform text
  • Output validation: Tool responses must be checked before the agent acts on them
  • Retry logic: APIs fail. The agent needs to handle timeouts and errors
  • Access controls: Not every tool should be available to every agent or every user

This is the point where building AI agents starts to feel less like prompt engineering and more like systems engineering. That shift is intentional.

Layer 3, state management: why LLMs are stateless and agents cannot be

State and memory sound similar. They are not the same thing, and confusing them is one of the most common sticking points in building AI agents.

“LLMs are stateless, but agents need state to manage execution.” Each call to a language model is independent. The model does not remember what it said last time unless you explicitly pass prior context into the prompt. An agent, by contrast, needs to know where it is in a multi-step process.

Agents need state to track:

  • The current task and user query
  • Which steps have been completed
  • Tool outputs collected so far
  • Pending actions
  • Stop conditions (max steps, success criteria, error thresholds)

State vs memory

These terms get used interchangeably, but they describe different things:

  • State = execution status for the current run. Example: “waiting for search result”
  • Memory = stored information that may persist across runs. Example: “user prefers concise summaries”

State is about the current task. Memory is about accumulated knowledge. Most agents need state. Not all agents need memory.

Simple state machine for a minimal agent

A state machine defines where the agent can be, what events move it forward, and when it stops. Here is a minimal set of states:

IDLE → PLANNING → CALLING_TOOL → EVALUATING_RESULT → RESPONDING → DONE
                       ↓                  ↓
                     ERROR          ERROR

  • IDLE: Waiting for a task
  • PLANNING: Deciding the next action
  • CALLING_TOOL: Executing a tool function
  • EVALUATING_RESULT: Inspecting the tool’s output
  • RESPONDING: Generating the final answer
  • DONE: Task complete
  • ERROR: Timeout, invalid tool output, or max steps reached

What moves the agent between states? Events like “tool returned a result,” “result is insufficient, re-plan,” or “max steps exceeded, stop.” Timeouts, invalid outputs, and step limits push the agent to ERROR or DONE to prevent infinite loops.

Putting it together: a minimal working agent in Python

This section combines all three layers into one small but functional example. The goal is a research assistant agent that answers a question using a web search tool and a calculator.

Step 1: define the task and the agent’s boundaries

Strong agents start with narrow scope. This agent can:

  • Search the web for information
  • Perform basic calculations
  • Synthesize results into a final answer

It cannot: send emails, modify databases, or take any real-world action. It has a maximum of five reasoning steps before it must return whatever it has.

Step 2: define the tools

Tools are Python functions with clear input/output contracts.

def search_web(query: str) -> str:
    """Search the web and return a summary of results."""
    # In production, this wraps a real search API with auth and error handling
    # Simulated for this example
    if "population" in query.lower():
        return "According to recent data, Tokyo has approximately 14 million residents in the city proper."
    return f"No strong results found for: {query}"

def calculate(expression: str) -> str:
    """Evaluate a math expression and return the result."""
    try:
        result = eval(expression)  # Use a safe parser in production
        return str(result)
    except Exception as e:
        return f"Calculation error: {e}"

In production, these would wrap real APIs and need authentication, error handling, rate limiting, and logging. The structure matters more than the implementation details here.

Step 3: create a simple state object

This is the piece many beginner tutorials leave out. Without explicit state, the agent has no way to track its own progress.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class AgentState:
    user_query: str
    current_step: int = 0
    max_steps: int = 5
    observations: list = field(default_factory=list)
    selected_tool: Optional[str] = None
    final_answer: Optional[str] = None
    status: str = "PLANNING"  # PLANNING, CALLING_TOOL, EVALUATING, DONE, ERROR

Step 4: implement the reasoning loop

The loop decides whether to call a tool, continue reasoning, or stop. It uses a lightweight decision schema from the model.

import json

TOOLS = {"search_web": search_web, "calculate": calculate}

def get_model_decision(state: AgentState) -> dict:
    """Ask the LLM what to do next. Returns a structured decision."""
    # In production, this calls your LLM API with the current state
    # The prompt includes the user query, observations so far, and available tools
    # Expected response format:
    # {"action": "call_tool" | "final_answer",
    #  "tool_name": "search_web" | "calculate",
    #  "tool_input": "...",
    #  "reasoning": "..."}
    
    # Simulated decision logic for this example
    if not state.observations:
        return {
            "action": "call_tool",
            "tool_name": "search_web",
            "tool_input": state.user_query,
            "reasoning": "No information gathered yet. Searching for relevant data."
        }
    return {
        "action": "final_answer",
        "tool_name": None,
        "tool_input": None,
        "reasoning": "Sufficient information gathered to answer."
    }

Step 5: handle tool results and update state

Tool calls do not always succeed. The agent needs to handle failures gracefully and update state regardless of outcome.

def execute_tool(state: AgentState, decision: dict) -> str:
    """Run the selected tool and return its output."""
    tool_name = decision["tool_name"]
    tool_input = decision["tool_input"]
    
    if tool_name not in TOOLS:
        return f"Error: Unknown tool '{tool_name}'"
    
    try:
        result = TOOLS[tool_name](tool_input)
        return result
    except Exception as e:
        return f"Tool execution failed: {e}"

Step 6: return a final answer

The finish point is explicit. The agent generates a final response grounded in its observations, not from the model’s internal knowledge alone.

def generate_final_answer(state: AgentState) -> str:
    """Produce the final answer based on collected observations."""
    # In production, pass observations to the LLM for synthesis
    obs_text = "\n".join(state.observations)
    return f"Based on research: {obs_text}"

Minimal Python example, full flow

def run_agent(user_query: str) -> str:
    """Run the agent loop: plan, act, observe, revise."""
    state = AgentState(user_query=user_query)
    
    while state.status not in ("DONE", "ERROR"):
        # Guard against infinite loops
        if state.current_step >= state.max_steps:
            state.status = "ERROR"
            return "Max steps reached. Partial observations: " + str(state.observations)
        
        # PLAN: ask the model what to do
        state.status = "PLANNING"
        decision = get_model_decision(state)
        
        if decision["action"] == "final_answer":
            # RESPOND: generate and return the answer
            state.status = "DONE"
            state.final_answer = generate_final_answer(state)
            return state.final_answer
        
        # ACT: call the selected tool
        state.status = "CALLING_TOOL"
        state.selected_tool = decision["tool_name"]
        result = execute_tool(state, decision)
        
        # OBSERVE: record the result
        state.observations.append(f"[{decision['tool_name']}] {result}")
        state.current_step += 1
        
        # REVISE: loop back to planning with updated state
        state.status = "PLANNING"
    
    return state.final_answer or "Agent completed without producing an answer."

# Example invocation
answer = run_agent("What is the population of Tokyo?")
print(answer)

This code is not production-grade. It is designed to make the architecture visible. Every component maps to one of the three layers: the while loop is the reasoning loop, TOOLS is the tool layer, and AgentState is the state management layer.

What this minimal agent still lacks

Honest accounting of what would need to change for production use:

  • Authentication and secrets management for real API keys
  • Schema validation on tool inputs and outputs
  • Monitoring and tracing to debug failures across steps
  • Retries with backoff for flaky tool calls
  • Human approval gates for high-impact actions
  • Evaluation benchmarks to measure agent accuracy over time
  • Cost monitoring to track LLM API spend per run

These are engineering concerns, not AI concerns. That distinction matters for career planning.

Design choices that matter more than model choice

Most failures in early agent systems come from architecture problems, not model limitations:

  • Poor tool design with ambiguous inputs or outputs
  • Missing state logic that lets the agent lose track of progress
  • Unclear stop conditions that cause infinite loops or premature exits
  • Weak error handling that silently swallows tool failures

A few architecture tradeoffs worth considering:

  • One general tool vs many narrow tools: Narrow tools are easier to test and validate. General tools are more flexible but harder to make reliable.
  • Short loop with tight constraints vs open-ended exploration: Tight constraints are easier to debug and cheaper to run. Open-ended loops handle more complex tasks but need stronger guardrails.
  • Explicit state machine vs ad hoc state in prompt text: State machines are more predictable. Prompt-based state is more flexible but harder to inspect and test.

The direct recommendation: start with the simplest architecture that can complete the task reliably. Add complexity only when you have evidence that the simple version fails.

When to go multi-agent, and when not to

“Think of a multi-agent system like a restaurant: hosts greet customers, waiters take orders, chefs cook, and managers oversee operations.” Each agent has a defined role. Coordination happens through clear handoffs, not one system trying to do everything.

Multi-agent patterns help when tasks naturally decompose into distinct roles:

  • One agent retrieves and gathers data
  • One evaluates quality or checks for errors
  • One drafts the final output
  • One routes, supervises, or decides escalation

But multi-agent is not the default answer. It adds orchestration complexity, more failure points, and harder evaluation.

Single agent vs multi-agent comparison

ApproachBest forStrengthsTradeoffsRecommendation
Single agentNarrow workflows, early prototypes, well-bounded tasksSimpler design, easier debugging, lower orchestration overheadCan become overloaded on complex workflowsStart here by default
Multi-agent systemComplex workflows with distinct roles or parallel workSpecialization, modularity, clearer role separationMore orchestration, more failure points, harder evaluationUse only when role separation clearly improves outcomes

A good rule of thumb

If one agent with good tools and clear state can complete the job, keep it single-agent. Add multiple agents only when responsibilities are clearly separable, specialization measurably improves quality, and the orchestration cost is worth the outcome.

Real-world deployment lessons: what breaks first

Early failures in AI agent deployments usually come from systems issues, not model intelligence. The model can reason well enough. The surrounding system falls apart.

Common failure patterns:

  • Tool failures: APIs time out, return unexpected formats, or hit rate limits
  • Bad permissions: The agent tries to access a resource it is not authorized to reach
  • Infinite loops: Missing stop conditions or poorly defined success criteria
  • Weak monitoring: No visibility into what the agent did or why it made a decision
  • Ambiguous task definitions: The agent cannot determine when it has answered the question

Moving from experiment to production means treating agents like software systems, not just prompts. Teams need logging and traces, retry strategies, guardrails on outputs, human fallback paths for unclear cases, and cost monitoring per run.

Deployment pattern examples

Support triage agent: Routes incoming tickets to the correct team based on content analysis. Uses a classification tool and a knowledge base lookup. Tracks ticket state through RECEIVED → CLASSIFIED → ROUTED → CONFIRMED. Guardrails: escalates to a human when confidence is below threshold. Logs every routing decision for audit.

Competitor research agent: Gathers product information from public sources, compares specs, and drafts a summary. Uses web search and a structured data parser. Tracks which competitors have been researched and which comparisons are complete. Guardrails: flags any claim that cannot be traced to a source URL. Requires human review before the summary is shared externally.

These are the kinds of skills that matter in the AI job landscape. Not just prompting a model, but designing systems that handle real operational requirements.

Future trends in building AI agents

Progress in agentic AI is likely to come from better orchestration, observability, and evaluation rather than just larger models. Several directions are gaining traction:

  • Better tool routing: Models are improving at selecting the right tool and structuring inputs correctly
  • Stronger structured outputs: More reliable JSON and schema-conformant responses reduce parsing errors
  • More reliable evaluation: New benchmarks and frameworks for measuring agent performance on multi-step tasks

The professionals who can move systems from experiment to production will be valuable to employers looking to operationalize AI at scale. That means understanding not just how to call a model, but how to design the system around it.

Conclusion

If you want to understand how to build an AI agent, do not start with hype or framework comparisons. Start with the three layers: a reasoning loop that can plan and revise, tools that connect the agent to real data and actions, and state management that tracks execution from start to finish.

Even a minimal Python agent built on these principles teaches the right architecture. The code in this article is deliberately simple. The patterns it demonstrates are the same ones that scale into production systems.

Building AI agents with Python is a skill that combines software engineering, system design, and applied AI. It is not about knowing one framework or one model. It is about understanding how to make them work together reliably.

Ready to go deeper?

If you want structured practice moving from agent concepts to working systems, the Agentic AI program is the logical next step. It covers building practical agent workflows, working with tools and orchestration, and developing the production-relevant judgment that separates demos from deployed systems. You can also explore the full catalog for programs across AI, data, and software engineering.