Agents: A Systematic Introduction to Agents

Two aspects that determine an agent's capabilities: tools and planning.

Artificial Intelligence: A Modern Approach (1995) defines an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators. This means that the characteristics of an agent depend on the environment it is in and the range of actions it can perform. If an agent is developed to play games (e.g., Minecraft, Go, Dota), then the game is its environment. If you want the agent to scrape documents from the internet, then the internet is its environment. The environment for an autonomous vehicle agent is the road system and its surrounding areas.

The range of actions an AI agent can perform is enhanced by the tools it has access to. ChatGPT is an agent that can search the web, execute Python code, and generate images. RAG systems are also agents—text retrievers, image retrievers, and SQL executors are their tools.

There is a strong dependency between an agent's environment and its toolset. The environment determines the tools that the agent may use. For example, if the environment is a game of chess, the only possible actions for the agent are valid chess moves. However, the agent's list of tools limits the environments it can operate in. For instance, if the only action of a robot is swimming, it will be restricted to aquatic environments.

SWE-agent is a coding agent, environment is the computer, and actions include navigation, searching, viewing files, and editing.

Compared to non-agent use cases, agents typically require more powerful models for two reasons:

Composite errors: Agents often need to perform multiple steps to complete a task, and as the number of steps increases, overall accuracy declines. If the model has an accuracy of 95% at each step, after 10 steps, the accuracy will drop to 60%, and after 100 steps, it will be only 0.6%.
Higher risk: With tools, agents can perform more impactful tasks, but any failure can have more severe consequences.

Whenever I talk to a group about autonomous AI agents, someone always brings up self-driving cars. “What if someone hacks the car and kidnaps you?” While the example of self-driving cars seems intuitive due to their physical nature, AI systems can cause harm without existing in the physical world.

Any organization looking to leverage artificial intelligence needs to take safety and security seriously. However, this does not mean that AI systems should never be given the ability to act in the real world. If we can trust machines to take us into space, I hope that one day, safety measures will be sufficient for us to trust autonomous AI systems. Moreover, humans can also make mistakes. Personally, I trust self-driving cars more than I would trust a stranger to give me a ride.

Planning and execution can be combined in the same prompt. For example, give the model a prompt asking it to think step-by-step (such as using chain-of-thought prompting) and then execute those steps in one prompt. But what if the model devises a 1000-step plan and fails to achieve the goal? Without supervision, the agent might run these steps for hours, wasting time and API call costs, until you realize it is making no progress.

To avoid ineffective execution, planning should be decoupled from execution. It is necessary to require the agent to first generate a plan, and only execute it after the plan has been validated. Plans can be validated through heuristic methods. For example, a simple heuristic method is to exclude plans that contain invalid operations. If the generated plan requires a Google search, and the agent cannot access Google search, then the plan is invalid. Another simple heuristic method might be to exclude any plans with more than X steps.

Decoupling planning (Planner) from execution (Executor) ensures that only validated (Evaluator) plans are executed.

The planning system now consists of three components: one for generating plans, one for validating plans, and another for executing plans. If each component is viewed as an agent, this can be seen as a multi-agent system. Since most agent workflows are complex enough to involve multiple components, most agents are multi-agent.

Solving a task typically involves the following process. It is important to note that reflection is not a strict requirement for agents, but it significantly enhances their performance.

Plan generation: Develop a plan to complete the task. A plan is a series of manageable actions, so this process is also referred to as task decomposition.
Reflection and error correction: Evaluate the generated plan. If the plan is poor, generate a new plan.
Execution: Take action according to the generated plan. This usually involves calling specific functions.
Reflection and error correction: After receiving the results of the actions, evaluate those results and determine whether the goal has been achieved. Identify and correct errors. If the goal is not completed, devise a new plan.

The simplest way to transform a model into a plan generator is through prompt engineering. Suppose we create an agent to help customers understand Kitty Vogue's products. This agent has access to three external tools: retrieving products by price, retrieving popular products, and retrieving product information. Here is an example of a prompt for plan generation. This prompt is for illustrative purposes only; prompts in actual production may be more complex.

Propose a plan to solve the task. You have access to 5 actions:

* get_today_date()
* fetch_top_products(start_date, end_date, num_products)
* fetch_product_info(product_name)
* generate_query(task_history, tool_output)
* generate_response(query)

The plan must be a sequence of valid actions.

Examples
Task: "Tell me about Fruity Fedora"
Plan: [fetch_product_info, generate_query, generate_response]

Task: "What was the best selling product last week?"
Plan: [fetch_top_products, generate_query, generate_response]

Task: {USER INPUT}
Plan:

Many model providers offer tool usage capabilities for their models, effectively transforming their models into agents, where tools are functions. Therefore, calling tools is often referred to as function calls. Generally, the operation of function calls is as follows:

Create a list of tools. Declare all tools that the model might want to use. Each tool is described by its execution entry point (e.g., its function name), parameters, and its documentation (e.g., the function's capabilities and required parameters).
Specify the tools available for the agent to query. Since different queries may require different tools, many APIs allow specifying a list of declared tools to use for each query.

For example, given the user query "How many kilograms is 40 pounds?", the agent might decide that it needs to use the tool lbs_to_kg_tool and pass in the parameter value 40. The agent's response might look like this.

response = ModelResponse(
    finish_reason='tool_calls',
    message=chat.Message(
        content=None,
        role='assistant',
        tool_calls=[
            ToolCall(
                function=Function(
                    arguments=**'{"lbs":40}'**,
                    name='lbs_to_kg'),
                type='function')
        ])
)

A plan is a roadmap outlining the steps needed to complete a task. The roadmap can have different levels of granularity. When planning for a year, quarterly plans are higher level than monthly plans, and monthly plans are higher level than weekly plans.

Detailed plans are harder to create but easier to execute; whereas higher-level plans are easier to create but more challenging to execute. One way to circumvent this trade-off is to adopt hierarchical planning. First, use the planner to generate high-level plans, such as quarterly plans. Then, for each quarter, use the same or different planner to further refine monthly plans.

Using more natural language helps the plan generator adapt better to changes in the tool API. If the model is primarily trained on natural language, it is likely to be better at understanding and generating natural language plans and less likely to hallucinate.

The downside of this approach is that it requires a translator to convert each natural language action into executable commands. Chameleon (Lu et al., 2023) refers to this translator as a program generator. However, translation is much simpler than planning and can be accomplished by weaker models, with a lower risk of hallucination.

Control flow includes sequences, parallelism, if statements, and for loops. When evaluating agent frameworks, check the control flows they support. For example, if the system needs to browse ten websites, can it do so simultaneously? Parallel execution can significantly reduce the perceived latency for users.

Even the best plans require constant evaluation and adjustment to maximize the chances of success. While reflection is not a strict necessity for agent operation, it is essential for the agent's success.

Reflection and error correction are two complementary mechanisms. Reflection generates insights that help identify errors that need correction. Reflection can be accomplished through self-critique prompts used by the same agent. It can also be done through a separate component, such as a dedicated scorer: a model that outputs specific scores for each result.

ReAct (Yao et al., 2022) proposes that interleaving reasoning with action has become a common pattern for agents. Yao et al. use the term "reasoning" to encompass planning and reflection. At each step, the agent is asked to explain its thinking (planning), take action, and then analyze the observed results (reflection) until the agent believes the task is complete.

A ReAct agent in action.

Reflection mechanisms can be implemented in multi-agent environments: one agent is responsible for planning and executing actions, while another agent evaluates the results after each step or several steps. If the agent's response fails to complete the task, you can prompt the agent to reflect on the reasons for the failure and how to improve. Based on this suggestion, the agent generates a new plan. This allows the agent to learn from its mistakes.

In the Reflexion framework, reflection is divided into two modules: an analyzer that evaluates results and a self-reflective module that analyzes the reasons for errors. The term "trajectory" is used to refer to plans. After each step, following evaluation and self-reflection, the agent proposes a new trajectory.

An example of how Reflexion works

Reflection is relatively easy to implement compared to plan generation and can yield surprising performance improvements. The downside of this approach is latency and cost. Thinking, observing, and sometimes even acting may require generating a large number of tokens, increasing costs and perceived latency for users, especially for tasks with many intermediate steps. To encourage their agents to follow formats, the authors of ReAct and Reflexion used numerous examples in their prompts. This increases the cost of computing input tokens and reduces the contextual space available for other information.

More tools will give agents more capabilities. However, the more tools there are, the harder it becomes to use them efficiently. This is similar to the increasing difficulty for humans to master a large number of tools. Adding tools also means increasing tool descriptions, which may not fit within the model's context.

Like many other decisions when building AI applications, tool selection requires experimentation and analysis. Here are some considerations that can help you make decisions:

Compare agent performance under different tool sets;
Conduct ablation studies to see how much agent performance declines if a certain tool is removed. If a tool can be removed without degrading performance, then remove it;
Look for tools that the agent frequently struggles with. If a tool is too difficult for the agent to use, for example, if extensive prompting or fine-tuning cannot teach the model to use it, then consider replacing the tool;
Plot a distribution of tool calls to see which tools are used most and least;

Different models and tasks exhibit different tool usage patterns. Different tasks require different tools. ScienceQA (science question answering task) relies more on knowledge retrieval tools than TabMWP (table math problem-solving task). Different models have different tool preferences. For example, GPT-4 seems to choose a broader tool set than ChatGPT. ChatGPT appears to lean more towards image descriptions, while GPT-4 leans more towards knowledge retrieval.

Planning is difficult and can fail in various ways. The most common mode of planning failure is tool usage failure. An agent may generate a plan that contains one or more of the following errors.

Invalid tool
For example, it generates a plan that includes bing_search, but that tool is not in the tool list.
Valid tool, invalid parameters.
For example, it calls lbs_to_kg with two parameters, but that function only requires one parameter, lbs.
Valid tool, incorrect parameter values.
For example, it calls lbs_to_kg with a parameter lbs, but uses the value 100 to represent pounds when it should use 120.

Another mode of planning failure is goal failure: the agent fails to achieve the goal. This may be because the plan does not solve the task, or although it solves the task, it does not adhere to constraints. For example, suppose you ask the model to plan a two-week trip from San Francisco to India with a budget of $5000. The agent might plan a trip from San Francisco to Vietnam or plan a two-week trip from San Francisco to India but exceed the budget significantly.

In agent evaluation, a commonly overlooked constraint is time. In many cases, the time it takes for the agent to complete a task is not that important, as tasks can be assigned to the agent and checked upon completion. However, in many cases, over time, the agent's effectiveness may diminish. For example, if you ask the agent to prepare a grant proposal and it completes it after the grant deadline, then the agent's help is not very useful.

An agent may generate an effective plan using the correct tools to complete a task, but it may not be efficient. Here are several aspects you might want to track to evaluate agent efficiency:

How many steps does the agent typically take to complete a task?
What is the average cost of the agent completing a task?
How long does each operation typically take? Are there particularly time-consuming or costly operations?

You can compare these metrics against your baseline, which could be another agent or a human. When comparing AI agents to humans, keep in mind that the operational modes of humans and AI are vastly different, so behaviors that are efficient for humans may be inefficient for AI, and vice versa. For example, visiting 100 web pages may be inefficient for a human who can only visit one page at a time, but it is a breeze for an AI agent that can access all pages simultaneously.

Original link: https://huyenchip.com/2025/01/07/agents.html