All About AI Agents

3 min read

February 2, 2025

We are going to see a lot more froth about AI agents in 2025. I expect a lot of money will continue to be spent pursuing AI agents but the short-term results will be disappointing to most of the people who are excited about this term.

What is an AI Agent?

The basic building block of agentic systems is an LLM enhanced with augmentations such as specialized content retrieval, tools, and memory. If we broaden this slightly we would add the concepts of LLM-driven control, autonomy, planning, and goal-orientation.

In the broadest sense of the term, an AI agent is a system that can take actions in the world to achieve goals and also learn from experience. This very broad definition includes everything from a simple reinforcement learning agent that learns to play a video game, to a complex system that can reason about the world and make decisions in a way that is similar to a human.

A self-driving car is an example of the latter. It takes actions in the world (driving the car) to achieve a goal (getting to a destination) and it learns from experience (by training on data from other cars and from human drivers). Despite spending $16 billion dollars on self-driving research by 2020 the technology is still not ready for mass adoption in 2025.

What are AI Models Good At Currently?

For starters, content creation. If I can manually prompt a LLM to generate marketing copy and I can prompt it to translate the marketing copy into the different languages of my prospects, then create personalized emails for each propect, I can link these tasks into a workflow. This is a very simple example, but it's a real-world task that can be done with current technology. However, in this example the AI model has very limited autonomy, lacks goal-orientation, and requires no planning.

What are AI Models Bad At Currently?

Learning from experience. AI models do not learn like humans do; they learn from training data. There is a cycle of improving/expanding training data, training, testing, and deploying that takes time.
Gullibility. Current AI models will believe anything you tell them, even if it's obviously false. I don't want my AI agent to click buy on "the best deal ever" on the first website it sees.
False confidence. AI models do not currently have a way to express uncertainty. They will provide requested output without sharing their confidence level.
Reasoning about reasoning. "Beware that o1’s mistakes include reasoning about how much it should reason. Sometimes the variance fails to accurately map to task difficulty. e.g. if the task is really simple, it will often spiral into reasoning rabbit holes for no reason."

Goal Orientation

Amanda Askell

@AmandaAskell

·Follow

The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

7:44 PM · Dec 9, 2024

360

Read 13 replies

Test-driven development is where you write a test that describes the behavior you want, then you write the code that makes the test pass. This forces the organization to think about what the code needs to do before you even start writing it. However, it's very hard to find "in the real world" because developers spend a lot of time writing tests before writing any functional code; therefore there is nothing to show off to management for a long time. Also, most organizations have a very hard time specifying what they want he software to do ahead of time.

For system prompt (SP) development you:

Write a test set of messages where the model fails, i.e. where the default behavior isn't what you want
Find an SP that causes those tests to pass
Find messages the SP is missaplied to and fix the SP
Expand your test set & repeat

A recently leaked Vercel V0 system prompt contains 1617 lines of code!

If you have a strong test suite you can adopt new models faster, iterate better and build more reliable and useful product features than your competition. However if we want to create goal-oriented agents with autonomy a broad set of tests are going to be essential.

References

Google Research: Can large language models identify and correct their mistakes?

Sharing is Caring

Edit this page