Agentic Ai | Towards Data Science

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

Iason Solomos — Thu, 10 Apr 2025 05:14:56 +0000

Introduction

I’ve always been fascinated by debates—the strategic framing, the sharp retorts, and the carefully timed comebacks. Debates aren’t just entertaining; they’re structured battles of ideas, driven by logic and evidence. Recently, I started wondering: could we replicate that dynamic using AI agents—having them debate each other autonomously, complete with real-time fact-checking and moderation? The result was Deb8flow, an autonomous AI debating environment powered by LangGraph, OpenAI’s GPT-4o model, and the new integrated Web Search feature.

In Deb8flow, two agents—Pro and Con—square off on a given topic while a Moderator manages turn-taking. A dedicated Fact Checker reviews every claim in real time using GPT-4o’s new browsing capabilities, and a final Judge evaluates the arguments for quality and coherence. If an agent repeatedly makes factual errors, they’re automatically disqualified—ensuring the debate stays grounded in truth.

This article offers an in-depth look at the advanced architecture and dynamic workflows that power autonomous AI debates. I’ll walk you through how Deb8flow’s modular design leverages LangGraph’s state management and conditional routing, alongside GPT-4o’s capabilities.

Even if you’re new to AI agents or LangGraph (see resources [1] and [2] for primers), I’ll explain the key concepts clearly. And if you’d like to explore further, the full project is available on GitHub: iason-solomos/Deb8flow.

Ready to see how AI agents can debate autonomously in practice?

Let’s dive in.

High-Level Overview: Autonomous Debates with Multiple Agents

In Deb8flow, we orchestrate a formal debate between two AI agents – one arguing Pro and one Con – complete with a Moderator, a Fact Checker, and a final Judge. The debate unfolds autonomously, with each agent playing a role in a structured format.

At its core, Deb8flow is a LangGraph-powered agent system, built atop LangChain, using GPT-4o to power each role—Pro, Con, Judge, and beyond. We use GPT-4o’s preview model with browsing capabilities to enable real-time fact-checking. In essence, the Pro and Con agents debate; after each statement, a fact-checker agent uses GPT-4o’s web search to catch any hallucinations or inaccuracies in that statement in real time. The debate only continues once the statement is verified. The whole process is coordinated by a LangGraph-defined workflow that ensures proper turn-taking and conditional logic.

High-level debate flow graph. Each rectangle is an agent node (Pro/Con debaters, Fact Checker, Judge, etc.), and diamonds are control nodes (Moderator and a router after fact-checking). Solid arrows denote the normal progression, while dashed arrows indicate retries if a claim fails fact-check. The Judge node outputs the final verdict, then the workflow ends.
Image generated by the author with DALL-E

The debate workflow goes through these stages:

Topic Generation: A Topic Generator agent produces a nuanced, debatable topic for the session (e.g. “Should AI be used in classroom education?”).
Opening: The Pro Argument Agent makes an opening statement in favor of the topic, kicking off the debate.
Rebuttal: The Debate Moderator then gives the floor to the Con Argument agent, who rebuts the Pro’s opening statement.
Counter: The Moderator gives the floor back to the Pro agent, who counters the Con agent’s points.
Closing: The Moderator switches the floor to the Con agent one last time for a closing argument.
Judgment: Finally, the Judge agent reviews the full debate history and evaluates both sides based on argument quality, clarity, and persuasiveness. The most convincing side wins.

After every single speech, the Fact Checker agent steps in to verify the factual accuracy of that statement. If a debater’s claim doesn’t hold up (e.g. cites a wrong statistic or “hallucinates” a fact), the workflow triggers a retry: the speaker has to correct or modify their statement. (If either debater accumulates 3 fact-check failures, they are automatically disqualified for repeatedly spreading inaccuracies, and their opponent wins by default.) This mechanism keeps our AI debaters honest and grounded in reality!

Prerequisites and Setup

Before diving into the code, make sure you have the following in place:

Python 3.12+ installed.
An OpenAI API key with access to the GPT-4o model. You can create your own API key here: https://platform.openai.com/settings/organization/api-keys
Project Code: Clone the Deb8flow repository from GitHub (git clone https://github.com/iason-solomos/Deb8flow.git). The repo includes a requirements.txt for all required packages. Key dependencies include LangChain/LangGraph (for building the agent graph) and the OpenAI Python client.
Install Dependencies: In your project directory, run: pip install -r requirements.txt to install the necessary libraries.
Create a .env file in the project root to hold your OpenAI API credentials. It should be of the form: OPENAI_API_KEY_GPT4O = "sk-…"
You can also at any time check out the README file: https://github.com/iason-solomos/Deb8flow if you simply want to run the finished app.

Once dependencies are installed and the environment variable is set, you should be ready to run the app. The project structure is organized for clarity:

Deb8flow/
├── configurations/
│ ├── debate_constants.py
│ └── llm_config.py
├── nodes/
│ ├── base_component.py
│ ├── topic_generator_node.py
│ ├── pro_debater_node.py
│ ├── con_debater_node.py
│ ├── debate_moderator_node.py
│ ├── fact_checker_node.py
│ ├── fact_check_router_node.py
│ └── judge_node.py
├── prompts/
│ ├── topic_generator_prompts.py
│ ├── pro_debater_prompts.py
│ ├── con_debater_prompts.py
│ └── … (prompts for other agents)
├── tests/ (contains unit and whole workflow tests)
└── debate_workflow.py

A quick tour of this structure:

configurations/ holds constant definitions and LLM configuration classes.

nodes/ contains the implementation of each agent or functional node in the debate (each of these is a module defining one agent’s behavior).

prompts/ stores the prompt templates for the language model (so each agent knows how to prompt GPT-4o for its specific task).

debate_workflow.py ties everything together by defining the LangGraph workflow (the graph of nodes and transitions).

debate_state.py defines the shared data structure that the agents will be using on each run.

tests/ includes some basic tests and example runs to help you verify everything is working.

Under the Hood: State Management and Workflow Setup

To coordinate a complex multi-turn debate, we need a shared state and a well-defined flow. We’ll start by looking at how Deb8flow defines the debate state and constants, and then see how the LangGraph workflow is constructed.

Defining the Debate State Schema (`debate_state.py`)

Deb8flow uses a shared state (https://langchain-ai.github.io/langgraph/concepts/low_level/#state ) in the form of a Python TypedDict that all agents can read from and update. This state tracks the debate’s progress and context – things like the topic, the history of messages, whose turn it is, etc. By centralizing this information, each agent node can make decisions based on the current state of the debate.

Link: debate_state.py

from typing import TypedDict, List, Dict, Literal


DebateStage = Literal["opening", "rebuttal", "counter", "final_argument"]

class DebateMessage(TypedDict):
    speaker: str  # e.g. pro or con
    content: str  # The message each speaker produced
    validated: bool  # Whether the FactChecker ok’d this message
    stage: DebateStage # The stage of the debate when this message was produced

class DebateState(TypedDict):
    debate_topic: str
    positions: Dict[str, str]
    messages: List[DebateMessage]
    opening_statement_pro_agent: str
    stage: str  # "opening", "rebuttal", "counter", "final_argument"
    speaker: str  # "pro" or "con"
    times_pro_fact_checked: int # The number of times the pro agent has been fact-checked. If it reaches 3, the pro agent is disqualified.
    times_con_fact_checked: int # The number of times the con agent has been fact-checked. If it reaches 3, the con agent is disqualified.

Key fields that we need to have in the DebateState include:

debate_topic (str): The topic being debated.
messages (List[DebateMessage]): A list of all messages exchanged so far. Each message is a dictionary with fields for speaker (e.g. "pro" or "con" or "fact_checker"), the message content (text), a validated flag (whether it passed fact-check), and the stage of the debate when it was produced.
stage (str): The current debate stage (one of "opening", "rebuttal", "counter", "final_argument").
speaker (str): Whose turn it is currently ("pro" or "con").
times_pro_fact_checked / times_con_fact_checked (int): Counters for how many times each side has been caught with a false claim. (In our rules, if a debater fails fact-check 3 times, they could be disqualified or automatically lose.)
positions (Dict[str, str]): (Optional) A mapping of each side’s general stance (e.g., "pro": "In favor of the topic").

By structuring the debate’s state, agents find it easy to access the conversation history or check the current stage, and the control logic can update the state between turns. The state is essentially the memory of the debate.

Constants and Configuration

To avoid “magic strings” scattered in the code, we define some constants in debate_constants.py. For example, constants for stage names (STAGE_OPENING = "opening", etc.), speaker identifiers (SPEAKER_PRO = "pro", SPEAKER_CON = "con", etc.), and node names (NODE_PRO_DEBATER = "pro_debater_node", etc.). These make the code easier to maintain and read.

debate_constants.py:

# Stage names
STAGE_OPENING = "opening"
STAGE_REBUTTAL = "rebuttal"
STAGE_COUNTER = "counter"
STAGE_FINAL_ARGUMENT = "final_argument"
STAGE_END = "end"

# Speakers
SPEAKER_PRO = "pro"
SPEAKER_CON = "con"
SPEAKER_JUDGE = "judge"

# Node names
NODE_PRO_DEBATER = "pro_debater_node"
NODE_CON_DEBATER = "con_debater_node"
NODE_DEBATE_MODERATOR = "debate_moderator_node"
NODE_JUDGE = "judge_node"

We also set up LLM configuration in llm_config.py. Here, we define classes for OpenAI or Azure OpenAI configs and then create a dictionary llm_config_map mapping model names to their config. For instance, we map "gpt-4o" to an OpenAILLMConfig that holds the model name and API key. This way, whenever we need to initialize a GPT-4o agent, we can just do llm_config_map["gpt-4o"] to get the right config. All our main agents (debaters, topic generator, judge) use this same GPT-4o configuration.

import os
from dataclasses import dataclass
from typing import Union

@dataclass
class OpenAILLMConfig:
    """
    A data class to store configuration details for OpenAI models.

    Attributes:
        model_name (str): The name of the OpenAI model to use.
        openai_api_key (str): The API key for authenticating with the OpenAI service.
    """
    model_name: str
    openai_api_key: str


llm_config_map = {
    "gpt-4o": OpenAILLMConfig(
        model_name="gpt-4o",
        openai_api_key=os.getenv("OPENAI_API_KEY_GPT4O"),
    )
}

Building the LangGraph Workflow (`debate_workflow.py`)

With state and configs in place, we construct the debate workflow graph. LangGraph’s StateGraph is the backbone that connects all our agent nodes in the order they should execute. Here’s how we set it up:

class DebateWorkflow:

    def _initialize_workflow(self) -> StateGraph:
        workflow = StateGraph(DebateState)
        # Nodes
        workflow.add_node("generate_topic_node", GenerateTopicNode(llm_config_map["gpt-4o"]))
        workflow.add_node("pro_debater_node", ProDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("con_debater_node", ConDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("fact_check_node", FactCheckNode())
        workflow.add_node("fact_check_router_node", FactCheckRouterNode())
        workflow.add_node("debate_moderator_node", DebateModeratorNode())
        workflow.add_node("judge_node", JudgeNode(llm_config_map["gpt-4o"]))

        # Entry point
        workflow.set_entry_point("generate_topic_node")

        # Flow
        workflow.add_edge("generate_topic_node", "pro_debater_node")
        workflow.add_edge("pro_debater_node", "fact_check_node")
        workflow.add_edge("con_debater_node", "fact_check_node")
        workflow.add_edge("fact_check_node", "fact_check_router_node")
        workflow.add_edge("judge_node", END)
        return workflow



    async def run(self):
        workflow = self._initialize_workflow()
        graph = workflow.compile()
        # graph.get_graph().draw_mermaid_png(output_file_path="workflow_graph.png")
        initial_state = {
            "topic": "",
            "positions": {}
        }
        final_state = await graph.ainvoke(initial_state, config={"recursion_limit": 50})
        return final_state

Let’s break down what’s happening:

We initialize a new StateGraph with our DebateState type as the state schema.
We add each node (agent) to the graph with a name. For nodes that need an LLM, we pass in the GPT-4o config. For example, "pro_debater_node" is added as ProDebaterNode(llm_config_map["gpt-4o"]), meaning the Pro debater agent will use GPT-4o as its underlying model.
We set the entry point of the graph to "generate_topic_node". This means the first step of the workflow is to generate a debate topic.
Then we add directed edges to connect nodes. The edges above encode the primary sequence: topic -> pro’s turn -> fact-check -> (then a routing decision) -> … eventually -> judge -> END. We don’t connect the Moderator or Fact Check Router with static edges, since these nodes use dynamic commands to redirect the flow. The final edge connects the judge to an END marker to terminate the graph.

When the workflow runs, control will pass along these edges in order, but whenever we hit a router or moderator node, that node will output a command telling the graph which node to go to next (overriding the default edge). This is how we create conditional loops: the fact_check_router_node might send us back to a debater node for a retry, instead of following a straight line. LangGraph supports this by allowing nodes to return a special Command object with goto instructions.

In summary, at a high level we’ve defined an agentic workflow: a graph of autonomous agents where control can branch and loop based on the agents’ outputs. Now, let’s explore what each of these agent nodes actually does.

Agent Nodes Breakdown

Each stage or role in the debate is encapsulated in a node (agent). In LangGraph, nodes are often simple functions, but I wanted a more object-oriented approach for clarity and reusability. So in Deb8flow, every node is a class with a __call__ method. All the main agent classes inherit from a common BaseComponent for shared functionality. This design makes the system modular: we can easily swap out or extend agents by modifying their class definitions, and each agent class is responsible for its piece of the workflow.

Let’s go through the key agents one by one.

`BaseComponent` – A Reusable Agent Base Class

Most of our agent nodes (like the debaters and judge) share common needs: they use an LLM to generate output, they might need to retry on errors, and they should track token usage. The BaseComponent class (defined in nodes/base_component.py) provides these common features so we don’t repeat code.

class BaseComponent:
    """
    A foundational class for managing LLM-based workflows with token tracking.
    Can handle both Azure OpenAI (AzureChatOpenAI) and OpenAI (ChatOpenAI).
    """

    def __init__(
        self,
        llm_config: Optional[LLMConfig] = None,
        temperature: float = 0.0,
        max_retries: int = 5,
    ):
        """
        Initializes the BaseComponent with optional LLM configuration and temperature.

        Args:
            llm_config (Optional[LLMConfig]): Configuration for either Azure or OpenAI.
            temperature (float): Controls the randomness of LLM outputs. Defaults to 0.0.
            max_retries (int): How many times to retry on 429 errors.
        """
        logger = logging.getLogger(self.__class__.__name__)
        tracer = trace.get_tracer(__name__, tracer_provider=get_tracer_provider())

        self.logger = logger
        self.tracer = tracer
        self.llm: Optional[ChatOpenAI] = None
        self.output_parser: Optional[StrOutputParser] = None
        self.state: Optional[DebateState] = None
        self.prompt_template: Optional[ChatPromptTemplate] = None
        self.chain: Optional[RunnableSequence] = None
        self.documents: Optional[List] = None
        self.prompt_tokens = 0
        self.completion_tokens = 0
        self.max_retries = max_retries

        if llm_config is not None:
            self.llm = self._init_llm(llm_config, temperature)
            self.output_parser = StrOutputParser()

    def _init_llm(self, config: LLMConfig, temperature: float):
        """
        Initializes an LLM instance for either Azure OpenAI or standard OpenAI.
        """
        if isinstance(config, AzureOpenAILLMConfig):
            # If it's Azure, use the AzureChatOpenAI class
            return AzureChatOpenAI(
                deployment_name=config.deployment_name,
                azure_endpoint=config.azure_endpoint,
                openai_api_version=config.openai_api_version,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        elif isinstance(config, OpenAILLMConfig):
            # If it's standard OpenAI, use the ChatOpenAI class
            return ChatOpenAI(
                model_name=config.model_name,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        else:
            raise ValueError("Unsupported LLMConfig type.")

    def validate_initialization(self) -> None:
        """
        Ensures we have an LLM and an output parser.
        """
        if not self.llm:
            raise ValueError("LLM is not initialized. Ensure `llm_config` is provided.")
        if not self.output_parser:
            raise ValueError("Output parser is not initialized.")

    def execute_chain(self, inputs: Any) -> Any:
        """
        Executes the LLM chain, tracks token usage, and retries on 429 errors.
        """
        if not self.chain:
            raise ValueError("No chain is initialized for execution.")

        retry_wait = 1  # Initial wait time in seconds

        for attempt in range(self.max_retries):
            try:
                with get_openai_callback() as cb:
                    result = self.chain.invoke(inputs)
                    self.logger.info("Prompt Token usage: %s", cb.prompt_tokens)
                    self.logger.info("Completion Token usage: %s", cb.completion_tokens)
                    self.prompt_tokens = cb.prompt_tokens
                    self.completion_tokens = cb.completion_tokens

                return result

            except Exception as e:
                # If the error mentions 429, do exponential backoff and retry
                if "429" in str(e):
                    self.logger.warning(
                        f"Rate limit reached. Retrying in {retry_wait} seconds... "
                        f"(Attempt {attempt + 1}/{self.max_retries})"
                    )
                    time.sleep(retry_wait)
                    retry_wait *= 2
                else:
                    self.logger.error(f"Unexpected error: {str(e)}")
                    raise e

        raise Exception("API request failed after maximum number of retries")

    def create_chain(
        self, system_template: str, human_template: str
    ) -> RunnableSequence:
        """
        Creates a chain for unstructured outputs.
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm | self.output_parser
        return self.chain

    def create_structured_output_chain(
        self, system_template: str, human_template: str, output_model: Type[BaseModel]
    ) -> RunnableSequence:
        """
        Creates a chain that yields structured outputs (parsed into a Pydantic model).
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm.with_structured_output(output_model)
        return self.chain

    def build_return_with_tokens(self, node_specific_data: dict) -> dict:
        """
        Convenience method to add token usage info into the return values.
        """
        return {
            **node_specific_data,
            "prompt_tokens": self.prompt_tokens,
            "completion_tokens": self.completion_tokens,
        }

    def __call__(self, state: DebateState) -> None:
        """
        Updates the node's local copy of the state.
        """
        self.state = state
        for key, value in state.items():
            setattr(self, key, value)

Key features of BaseComponent:

It stores an LLM client (e.g. an OpenAI ChatOpenAI instance) initialized with a given model and API key, as well as an output parser.
It provides a method create_chain(system_template, human_template) which sets up a LangChain prompt chain (a RunnableSequence) combining a system prompt and a human prompt. This chain is what actually generates outputs when run.
It has an execute_chain(inputs) method that invokes the chain and includes logic to retry if the OpenAI API returns a rate-limit error (HTTP 429). This is done with exponential backoff up to a max_retries count.
It keeps track of token usage (prompt tokens and completion tokens) for logging or analysis.
The __call__ method of BaseComponent (which each subclass will call via super().__call__(state)) can perform any setup needed before the node’s main logic runs (like ensuring the LLM is initialized).

By building on BaseComponent, each agent class can focus on its unique logic (like what prompt to use and how to handle the state), while inheriting the heavy lifting of interacting with GPT-4o reliably.

Topic Generator Agent (`GenerateTopicNode`)

The Topic Generator (topic_generator_node.py) is the first agent in the graph. Its job is to come up with a debatable topic for the session. We give it a prompt that instructs it to output a nuanced topic that could reasonably have a pro and con side.

This agent inherits from BaseComponent and uses a prompt chain (system + human prompt) to generate one item of text – the debate topic. When called, it executes the chain (with no special input, just using the prompt) and gets back a topic_text. It then updates the state with:

debate_topic: the generated topic (stripped of any extra whitespace),
positions: a dictionary assigning the pro and con stances (by default we use "In favor of the topic" and "Against the topic"),
stage: set to "opening",
speaker: set to "pro" (so the Pro side will speak first).

In code, the return might look like:

return {
    "debate_topic": debate_topic,
    "positions": positions,
    "stage": "opening",
    "speaker": first_speaker  # "pro"
}

Here are the prompts for the topic generator:

SYSTEM_PROMPT = """\
You are a brainstorming AI that suggests debate topics.
You will provide a single, interesting or timely topic that can have two opposing views.
"""

HUMAN_PROMPT = """\
Please suggest one debate topic for two AI agents to discuss.
For example, it could be about technology, politics, philosophy, or any interesting domain.
Just provide the topic in a concise sentence.
"""

Then we pass these prompts in the constructor of the class itself.

class GenerateTopicNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        # Create the prompt chain.
        self.chain: RunnableSequence = self.create_chain(
            system_template=SYSTEM_PROMPT,
            human_template=HUMAN_PROMPT
        )

    def __call__(self, state: DebateState) -> Dict[str, str]:
        """
        Generates a debate topic and assigns positions to the two debaters.
        """
        super().__call__(state)

        topic_text = self.execute_chain({})

        # Store the topic and assign stances in the DebateState
        debate_topic = topic_text.strip()
        positions = {
            "pro": "In favor of the topic",
            "con": "Against the topic"
        }

        
        first_speaker = "pro"
        self.logger.info("Welcome to our debate panel! Today's debate topic is: %s", debate_topic)
        return {
            "debate_topic": debate_topic,
            "positions": positions,
            "stage": "opening",
            "speaker": first_speaker
        }

It’s a pattern we will repeat for all classes except for those not using LLMs and the fact checker.

Now we can implement the 2 stars of the show, the Pro and Con argument agents!

Debater Agents (Pro and Con)

Link: pro_debater_node.py

The two debater agents are very similar in structure, but each uses different prompt templates tailored to their role (pro vs con) and the stage of the debate.

The Pro debater, for example, has to handle an opening statement and a counter-argument (countering the Con’s rebuttal). We also need logic for retries in case a statement fails fact-check. In code, the ProDebater class sets up multiple prompt chains:

opening_chain and an opening_retry_chain (using slightly different human prompts – the retry prompt might instruct it to try again without repeating any factually dubious claims).
counter_chain and counter_retry_chain for the counter-argument stage.

class ProDebaterNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        self.opening_chain = self.create_chain(SYSTEM_PROMPT, OPENING_HUMAN_PROMPT)
        self.opening_retry_chain = self.create_chain(SYSTEM_PROMPT, OPENING_RETRY_HUMAN_PROMPT)
        self.counter_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_HUMAN_PROMPT)
        self.counter_retry_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_RETRY_HUMAN_PROMPT)

    def __call__(self, state: DebateState) -> Dict[str, Any]:
        super().__call__(state)

        debate_topic = state.get("debate_topic")
        messages = state.get("messages", [])
        stage = state.get("stage")
        speaker = state.get("speaker")

        # Check if retrying (last message was by pro and not validated)
        last_msg = messages[-1] if messages else None
        retrying = last_msg and last_msg["speaker"] == SPEAKER_PRO and not last_msg["validated"]

        if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            chain = self.opening_retry_chain if retrying else self.opening_chain # select which chain we are triggering: the normal one or the fact-cehcked one
            result = chain.invoke({
                "debate_topic": debate_topic
            })
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            opponent_msg = self._get_last_message_by(SPEAKER_CON, messages)
            debate_history = get_debate_history(messages)
            chain = self.counter_retry_chain if retrying else self.counter_chain
            result = chain.invoke({
                "debate_topic": debate_topic,
                "opponent_statement": opponent_msg,
                "debate_history": debate_history
            })
        else:
            raise ValueError(f"Unknown turn for ProDebater: stage={stage}, speaker={speaker}")
        new_message = create_debate_message(speaker=SPEAKER_PRO, content=result, stage=stage)
        self.logger.info("Speaker: %s, Stage: %s, Retry: %s\nMessage:\n%s", speaker, stage, retrying, result)
        return {
            "messages": messages + [new_message]
        }

    def _get_last_message_by(self, speaker_prefix, messages):
        for m in reversed(messages):
            if m.get("speaker") == speaker_prefix:
                return m["content"]
        return ""

When the ProDebater’s __call__ runs, it looks at the current stage and speaker in the state to decide what to do:

If it’s the opening stage and the speaker is “pro”, it uses the opening_chain to generate an opening argument. If the last message from Pro was marked invalid (not validated), it knows this is a retry, so it would use the opening_retry_chain instead.
If it’s the counter stage and speaker is “pro”, it generates a counter-argument to whatever the opponent (Con) just said. It will fetch the last message by the Con from the messages history, and feed that into the prompt (so that the Pro can directly counter it). Again, if the last Pro message was invalid, it would switch to the retry chain.

After generating its argument, the Debater agent creates a new message entry (with speaker="pro", the content text, validated=False initially, and the stage) and appends it to the state’s message list. That becomes the output of the node (LangGraph will merge this partial state update into the global state).

The Con Debater agent mirrors this logic for its stages:

It similarly appends its message to the state.

It has a rebuttal and closing argument (final argument) stage, each with a normal and a retry chain.

It checks if it’s the rebuttal stage (speaker “con”) or final argument stage (speaker “con”) and invokes the appropriate chain, possibly using the last Pro message for context when rebutting.

con_debater_node.py

By using class-based implementation, our debaters’ code is easier to maintain. We can clearly separate what the Pro does vs what the Con does, even if they share structure. Also, by encapsulating prompt chains inside the class, each debater can manage multiple possible outputs (regular vs retry) cleanly.

Prompt design: The actual prompts (in prompts/pro_debater_prompts.py and con_debater_prompts.py) guide the GPT-4o model to take on a persona (“You are a debater arguing for/against the topic…”) and produce the argument. They also instruct the model to keep statements factual and logical. If a fact check fails, the retry prompt may say something like: “Your previous statement had an unverified claim. Revise your argument to be factually correct while maintaining your position.” – encouraging the model to correct itself.

With this, our AI debaters can engage in a multi-turn duel, and even recover from factual missteps.

Fact Checker Agent (`FactCheckNode`)

After each debater speaks, the Fact Checker agent swoops in to verify their claims. This agent is implemented in fact_checker_node.py, and interestingly, it uses the GPT-4o model’s browsing ability rather than our own custom prompts. Essentially, we delegate the fact-checking to OpenAI’s GPT-4 with web search.

How does this work? The OpenAI Python client for GPT-4 (with browsing) allows us to send a user message and get a structured response. In FactCheckNode.__call__, we do something like:

completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-search-preview",
            web_search_options={},
            messages=[{
                "role": "user",
                "content": (
                        f"Consider the following statement from a debate. "
                        f"If the statement contains numbers, or figures from studies, fact-check it online.\n\n"
                        f"Statement:\n\"{claim}\"\n\n"
                        f"Reply clearly whether any numbers or studies might be inaccurate or hallucinated, and why."
                        f"\n"
                        f"If the statement doesn't contain references to studies or numbers cited, don't go online to fact-check, and just consider it successfully fact-checked, with a 'yes' score.\n\n"
                )
            }],
            response_format=FactCheck
        )

If the result is “yes” (meaning the claim seems truthful or at least not factually wrong), the Fact Checker will mark the last message’s validated field as True in the state, and output {"validated": True} with no further changes. This signals that the debate can continue normally.

If the result is “no” (meaning it found the claim to be incorrect or dubious), the Fact Checker will append a new message to the state with speaker="fact_checker" describing the finding (or we could simply mark it, but providing a brief note like “(Fact Checker: The statistic cited could not be verified.)” can be useful). It will also set validated: False and increment a counter for whichever side made the claim. The output state from this node includes validated: False and an updated times_pro_fact_checked or times_con_fact_checked count.

We also use a Pydantic BaseModel to control the output of the LLM:

class FactCheck(BaseModel):
    """
    Pydantic model for the fact checking the claims made by debaters.

    Attributes:
        binary_score (str): 'yes' if the claim is verifiable and truthful, 'no' otherwise.
    """

    binary_score: str = Field(
        description="Indicates if the claim is verifiable and truthful. 'yes' or 'no'."
    )
    justification: str = Field(
        description="Explanation of the reasoning behind the score."
    )

Debate Moderator Agent (`DebateModeratorNode`)

The Debate Moderator is the conductor of the debate. Instead of producing lengthy text, this agent’s job is to manage turn-taking and stage progression. In the workflow, after a statement is validated by the Fact Checker, control passes to the Moderator node. The Moderator then issues a Command that updates the state for the next turn and directs the flow to the appropriate next agent.

The logic in DebateModeratorNode.__call__ (see nodes/debate_moderator_node.py) goes roughly like this:

if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_REBUTTAL, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_REBUTTAL and speaker == SPEAKER_CON:
            return Command(
                update={"stage": STAGE_COUNTER, "speaker": SPEAKER_PRO},
                goto=NODE_PRO_DEBATER
            )
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_FINAL_ARGUMENT, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_FINAL_ARGUMENT and speaker == SPEAKER_CON:
            return Command(
                update={},
                goto=NODE_JUDGE
            )

        raise ValueError(f"Unexpected stage/speaker combo: stage={stage}, speaker={speaker}")

Each conditional corresponds to a point in the debate where a turn just ended, and sets up the next turn. For example, after the opening (Pro just spoke), it sets stage to rebuttal, switches speaker to Con, and directs the workflow to the Con debater node. After the final_argument (Con’s closing), it directs to the Judge with no further update (the debate stage effectively ends).

Fact Check Router (`FactCheckRouterNode`)

This is another control node (like the Moderator) that introduces conditional logic. The Fact Check Router sits right after the Fact Checker agent in the flow. Its purpose is to branch the workflow depending on the fact-check result.

In nodes/fact_check_router_node.py, the logic is:

if pro_fact_checks >= 3 or con_fact_checks >= 3:
            disqualified = SPEAKER_PRO if pro_fact_checks >= 3 else SPEAKER_CON
            winner = SPEAKER_CON if disqualified == SPEAKER_PRO else SPEAKER_PRO

            verdict_msg = {
                "speaker": "moderator",
                "content": (
                    f"Debate ended early due to excessive factual inaccuracies.\n\n"
                    f"DISQUALIFIED: {disqualified.upper()} (exceeded fact check limit)\n"
                    f"WINNER: {winner.upper()}"
                ),
                "validated": True,
                "stage": "verdict"
            }
            return Command(
                update={"messages": messages + [verdict_msg]},
                goto=END
            )
        if last_message.get("validated"):
            return Command(goto=NODE_DEBATE_MODERATOR)
        elif speaker == SPEAKER_PRO:
            return Command(goto=NODE_PRO_DEBATER)
        elif speaker == SPEAKER_CON:
            return Command(goto=NODE_CON_DEBATER)
        raise ValueError("Unable to determine routing in FactCheckRouterNode.")

First, the Fact Check Router checks if either side’s fact-check count has reached 3. If so, it creates a Moderator-style message announcing an early end: the offending side is disqualified and the other side is the winner. It appends this verdict to the messages and returns a Command that jumps to END, effectively terminating the debate without going to the Judge (because we already know the outcome).

If we’re not ending the debate early, it then looks at the Fact Checker’s result for the last message (which is stored as validated on that message). If validated is True, we go to the debate moderator: Command(goto=debate_moderator_node).

Else if the statement fails fact-check, the workflow goes back to the debater to produce a revised statement (with the state counters updated to reflect the failure). This loop can happen multiple times if needed (up to the disqualification limit).

This dynamic control is the heart of Deb8flow’s “agentic” nature – the ability to adapt the path of execution based on the content of the agents’ outputs. It showcases LangGraph’s strength: combining control flow with state. We’re essentially encoding debate rules (like allowing retries for false claims, or ending the debate if someone cheats too often) directly into the workflow graph.

Judge Agent (`JudgeNode`)

Last but not least, the Judge agent delivers the final verdict based on rhetorical skill, clarity, structure, and overall persuasiveness. Its system prompt and human prompt make this explicit:

System Prompt: “You are an impartial debate judge AI. … Evaluate which debater presented their case more clearly, persuasively, and logically. You must focus on communication skills, structure of argument, rhetorical strength, and overall coherence.”
Human Prompt: “Here is the full debate transcript. Please analyze the performance of both debaters—PRO and CON. Evaluate rhetorical performance—clarity, structure, persuasion, and relevance—and decide who presented their case more effectively.”

When the Judge node runs, it receives the entire debate transcript (all validated messages) alongside the original topic. It then uses GPT-4o to examine how each side framed their arguments, handled counterpoints, and supported (or failed to support) claims with examples or logic. Crucially, the Judge is forbidden to evaluate which position is objectively correct (or who it thinks might be correct)—only who argued more persuasively.

Below is an example final verdict from a Deb8flow run on the topic:
“Should governments implement a universal basic income in response to increasing automation in the workforce?”

WINNER: PRO

REASON: The PRO debater presented a more compelling and rhetorically effective case for universal basic income. Their arguments were well-structured, beginning with a clear statement of the issue and the necessity of UBI in response to automation. They effectively addressed potential counterarguments by highlighting the unprecedented speed and scope of current technological changes, which distinguishes the current situation from past technological shifts. The PRO also provided empirical evidence from UBI pilot programs to counter the CON's claims about work disincentives and economic inefficiencies, reinforcing their argument with real-world examples.

In contrast, the CON debater, while presenting valid concerns about UBI, relied heavily on historical analogies and assumptions about workforce adaptability without adequately addressing the unique challenges posed by modern automation. Their arguments about the fiscal burden and potential inefficiencies of UBI were less supported by specific evidence compared to the PRO's rebuttals.

Overall, the PRO's arguments were more coherent, persuasive, and backed by empirical evidence, making their case more convincing to a neutral observer.

Langsmith Tracing

Throughout Deb8flow’s development, I relied on LangSmith (LangChain’s tracing and observability toolkit) to ensure the entire debate pipeline was behaving correctly. Because we have multiple agents passing control between themselves, it’s easy for unexpected loops or misrouted states to occur. LangSmith provides a convenient way to:

Visualize Execution Flow: You can see each agent’s prompt, the tokens consumed (so you can also track costs), and any intermediate states. This makes it much simpler to confirm that, say, the Con Debater is properly referencing the Pro Debater’s last message, or that the Fact Checker is accurately receiving the claim to verify.
Debug State Updates: If the Moderator or Fact Check Router is sending the flow to the wrong node, the trace will highlight that mismatch. You can trace which agent was invoked at each step and why, helping you spot stage or speaker misalignments early.
Track Prompt and Completion Tokens: With multiple GPT-4o calls, it’s useful to see how many tokens each stage is using, which LangSmith logs automatically if you enable tracing.

Integrating LangSmith is unexpectedly easy. You will just need to provide these 3 keys in your .env file: LANGCHAIN_API_KEY

LANGCHAIN_TRACING_V2

LANGCHAIN_PROJECT

Then you can open the LangSmith UI to see a structured trace of each run. This greatly reduces the guesswork involved in debugging multi-agent systems and is, in my experience, essential for more complex AI orchestration like ours. Example of a single run:

The trace in waterfall mode in Lansmith of one run, showing how the whole flow ran. Source: Generated by the author using Langsmith.

Reflections and Next Steps

Building Deb8flow was an eye-opening exercise in orchestrating autonomous agent workflows. We didn’t just chain a single model call – we created an entire debate simulation with AI agents, each with a specific role, and allowed them to interact according to a set of rules. LangGraph provided a clear framework to define how data and control flows between agents, making the complex sequence manageable in code. By using class-based agents and a shared state, we maintained modularity and clarity, which will pay off for any software engineering project in the long run.

An exciting aspect of this project was seeing emergent behavior. Even though each agent follows a script (a prompt), the unscripted combination – a debater trying to deceive, a fact-checker catching it, the debater rephrasing – felt surprisingly realistic! It’s a small step toward more Agentic Ai systems that can perform non-trivial multi-step tasks with oversight on each other.

There’s plenty of ideas for improvement:

User Interaction: Currently it’s fully autonomous, but one could add a mode where a human provides the topic or even takes the role of one side against an AI opponent.
We can switch the order in which the Debaters talk.
We can change the prompts, and thus to a good degree the behavior of the agents, and experiment with different prompts.
Make the debaters also perform web search before producing their statements, thus providing them with the latest information.

The broader implication of Deb8flow is how it showcases a pattern for composable AI agents. By defining clear boundaries and interactions (just like microservices in software), we can have complex AI-driven processes that remain interpretable and controllable. Each agent is like a cog in a machine, and LangGraph is the gear system making them work in unison.

I found this project energizing, and I hope it inspires you to explore multi-agent workflows. Whether it’s debating, collaborating on writing, or solving problems from different expert angles, the combination of GPT, tools, and structured agentic workflows opens up a new world of possibilities for AI development. Happy hacking!

References

[1] D. Bouchard, “From Basics to Advanced: Exploring LangGraph,” Medium, Nov. 22, 2023. [Online]. Available: https://medium.com/data-science/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787. [Accessed: Apr. 1, 2025].

[2] A. W. T. Ng, “Building a Research Agent that Can Write to Google Docs: Part 1,” Towards Data Science, Jan. 11, 2024. [Online]. Available: https://towardsdatascience.com/building-a-research-agent-that-can-write-to-google-docs-part-1-4b49ea05a292/. [Accessed: Apr. 1, 2025].

The post Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o appeared first on Towards Data Science.

Creating an AI Agent to Write Blog Posts with CrewAI

Gustavo Santos — Fri, 04 Apr 2025 18:11:49 +0000

Introduction

I love writing. You may notice that if you follow me or my blog. For that reason, I am constantly producing new content and talking about Data Science and Artificial Intelligence.

I discovered this passion a couple of years ago when I was just starting my path in Data Science, learning and evolving my skills. At that time, I heard some more experienced professionals in the area saying that a good study technique was practicing new skills and writing about it somewhere, teaching whatever you learned.

In addition, I had just moved to the US, and nobody knew me here. So I had to start somewhere, creating my professional image in this competitive market. I remember I talked to my cousin, who’s also in the Tech industry, and he told me: write blog posts about your experiences. Tell people what you are doing. And so I did.

And I never stopped.

Fast forward to 2025, now I have almost two hundred published articles, many of them with Towards Data Science, a published Book, and a good audience.

Writing helped me so much in the Data Science area.

Most recently, one of my interests has been the amazing Natural Language Processing and Large Language Models subjects. Learning about how these modern models work is fascinating.

That interest led me to experiment with Agentic Ai as well. So, I learned about CrewAI, an easy and open-source package that helps us build AI agents in a fun and easy way, with little code. I decided to test it by creating a crew of agents to write a blog post, and then see how that goes.

In this post, we will learn how to create those agents and make them work together to produce a simple blog post.

Let’s do that.

What is a Crew?

A crew of AI agents is a combination of two or more agents, each of them performing a task towards a final goal.

In this case study, we will create a crew that will work together to produce a small blog post about a given topic that we will provide.

Crew of Agents workflow. Image by the author

The flow works like this:

We choose a given topic for the agents to write about.
Once the crew is started, it will go to the knowledge base, read some of my previously written articles, and try to mimic my writing style. Then, it generates a set of guidelines and passes it to the next agent.
Next, the Planner agent takes over and searches the Internet looking for good content about the topic. It creates a plan of content and sends it to the next agent.
The Writer agent receives the writing plan and executes it according to the context and information received.
Finally, the content is passed to the last agent, the Editor, who reviews the content and returns the final document as the output.

In the following section, we will see how this can be created.

Code

CrewAI is a great Python package because it simplifies the code for us. So, let’s begin by installing the two needed packages.

pip install crewai crewai-tools

Next, if you want, you can follow the instructions on their Quickstart page and have a full project structure created for you with just a couple of commands on a terminal. Basically, it will install some dependencies, generate the folder structure suggested for CrewAI projects, as well as generate some .yaml and .py files.

I personally prefer to create those myself, but it is up to you. The page is listed in the References section.

Folder Structure

So, here we go.

We will create these folders:

knowledge
config

And these files:

In the config folder: create the files agents.yaml and tasks.yaml
In the knowledge folder, that’s where I will add the files with my writing style.
In the project root: create crew.py and main.py.

Folders structure. Image by the author.

Make sure to create the folders with the names mentioned, as CrewAI looks for agents and tasks inside the config folder and for the knowledge base within a knowledge folder.

Next, let us set our agents.

Agents

The agents are composed of:

Name of the agent: writer_style
Role: LLMs are good role players, so here you can tell them which role to play.
Goal: tell the model what the goal of that agent is.
Backstory: Describe the story behind this agent, who it is, what it does.

writer_style:
  role: >
    Writing Style Analyst
  goal: >
    Thoroughly read the knowledge base and learn the characteristics of the crew, 
    such as tone, style, vocabulary, mood, and grammar.
  backstory: >
    You are an experienced ghost writer who can mimic any writing style.
    You know how to identify the tone and style of the original writer and mimic 
    their writing style.
    Your work is the basis for the Content Writer to write an article on this topic.

I won’t bore you with all the agents created for this crew. I believe you got the idea. It is a set of prompts explaining to each agent what they are going to do. All the agents instructions are stored in the agents.yaml file.

Think of it as if you were a manager hiring people to create a team. Think about what kinds of professionals you would need, and what skills are needed.

We need 4 professionals who will work towards the final goal of producing written content: (1) a Writer Stylist, (2) a Planner, (3) a Writer, and (4) an Editor.

If you want to see the setup for them, just check the full code in the GitHub repository.

Tasks

Now, back to the analogy of the manager hiring people, once we “hired” our entire crew, it is time to separate the tasks. We know that we want to produce a blog post, we have 4 agents, but what each of them will do.

Well, that will be configured in the file tasks.yaml.

To illustrate, let me show you the code for the Writer agent. Once again, these are the parts needed for the prompt:

Name of the task: write
Description: The description is like telling the professional how you want that task to be performed, just like we would tell a new hire how to perform their new job. Give precise instructions to get the best result possible.
Expected output: This is how we want to see the output. Notice that I give instructions like the size of the blog post, the quantity of paragraphs, and other information that helps my agent to give me the expected output.
Agent to perform it: Here, we are indicating the agent who will perform this task, using the same name set in the agents.yaml file.
Output file: Now always applicable, but if so, this is the argument to use. We asked for a markdown file as output.

write:
  description: >
    1. Use the content plan to craft a compelling blog post on {topic}.
    2. Incorporate SEO keywords naturally.
    3. Sections/Subtitles are properly named in an engaging manner. Make sure 
    to add Introduction, Problem Statement, Code, Before You Go, References.
    4. Add a summarizing conclusion - This is the "Before You Go" section.
    5. Proofread for grammatical errors and alignment with the writer's style.
    6. Use analogies to make the article more engaging and complex concepts easier
    to understand.
  expected_output: >
    A well-written blog post in markdown format, ready for publication.
    The article must be within a 7 to 12 minutes read.
    Each section must have at least 3 paragraphs.
    When writing code, you will write a snippet of code and explain what it does. 
    Be careful to not add a huge snippet at a time. Break it in reasonable chunks.
    In the examples, create a sample dataset for the code.
    In the Before You Go section, you will write a conclusion that is engaging
    and factually accurate.
  agent: content_writer
  output_file: blog_post.md

After the agents and tasks are defined, it is time to create our crew flow.

Coding the Crew

Now we will create the file crew.py, where we will translate the previously presented flow to Python code.

We begin by importing the needed modules.

#Imports
import os
from crewai import Agent, Task, Process, Crew, LLM
from crewai.project import CrewBase, agent, crew, task
from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource
from crewai_tools import SerperDevTool

We will use the basic Agent, Task, Crew, Process and LLM to create our flow. PDFKnowledgeSource will help the first agent learning my writing style, and SerperDevTool is the tool to search the internet. For that one, make sure to get your API key at https://serper.dev/signup.

A best practice in software development is to keep your API keys and configuration settings separate from your code. We’ll use a .env file for this, providing a secure place to store these values. Here’s the command to load them into our environment.

from dotenv import load_dotenv
load_dotenv()

Then, we will use the PDFKnowledgeSource to show the Crew where to search for the writer’s style. By default, that tool looks at the knowledge folder of your project, thus the importance of the name being the same.

# Knowledge sources

pdfs = PDFKnowledgeSource(
    file_paths=['article1.pdf',
                'article2.pdf',
                'article3.pdf'
                ]
)

Now we can set up the LLM we want to use for the Crew. It can be any of them. I tested a bunch of them, and those I liked the most were qwen-qwq-32b and gpt-4o. If you choose OpenAI’s, you will need an API Key as well. For Qwen-QWQ, just uncomment the code and comment out the OpenAI’s lines.. You need an API key from Groq.

# LLMs

llm = LLM(
    # model="groq/qwen-qwq-32b",
    # api_key= os.environ.get("GROQ_API_KEY"),
    model= "gpt-4o",
    api_key= os.environ.get("OPENAI_API_KEY"),
    temperature=0.4
)

Now we have to create a Crew Base, showing where CrewAI can find the agents and tasks configuration files.

# Creating the crew: base shows where the agents and tasks are defined

@CrewBase
class BlogWriter():
    """Crew to write a blog post"""
    agents_config = "config/agents.yaml"
    tasks_config = "config/tasks.yaml"

Agents Functions

And we are ready to create the code for each agent. They are composed of a decorator @agent to show that the following function is an agent. We then use the class Agent and indicate the name of the agent in the config file, the level of verbosity, being 1 low, 2 high. You can also use a Boolean value, such as true or false.

Lastly, we specify if the agent uses any tool, and what model it will use.

# Configuring the agents
    @agent
    def writer_style(self) -> Agent:
        return Agent(
                config=self.agents_config['writer_style'],
                verbose=1,
                knowledge_sources=[pdfs]
                )

    @agent
    def planner(self) -> Agent:
        return Agent(
        config=self.agents_config['planner'],
        verbose=True,
        tools=[SerperDevTool()],
        llm=llm
        )

    @agent
    def content_writer(self) -> Agent:
        return Agent(
        config=self.agents_config['content_writer'],
        verbose=1
        )

    @agent
    def editor(self) -> Agent:
        return Agent(
        config=self.agents_config['editor'],
        verbose=1
        )

Tasks Functions

The next step is creating the tasks. Similarly to the agents, we will create a function and decorate it with @task. We use the class Task to inherit CrewAI’s functionalities and then point to the task to be used from our tasks.yaml file to be used for each task created. If any output file is expected, use the output_file argument.

# Configuring the tasks    

    @task
    def style(self) -> Task:
        return Task(
        config=self.tasks_config['mystyle'],
        )

    @task
    def plan(self) -> Task:
        return Task(
        config=self.tasks_config['plan'],
        )

    @task
    def write(self) -> Task:
        return Task(
        config=self.tasks_config['write'],
        output_file='output/blog_post.md' # This is the file that will be contain the final blog post.
        )

    @task
    def edit(self) -> Task:
        return Task(
        config=self.tasks_config['edit']
        )

Crew

To glue everything together, we now create a function and decorate it with the @crew decorator. That function will line up the agents and the tasks in the order to be performed, since the process chosen here is the simplest: sequential. In other words, everything runs in sequence, from start to finish.

@crew

    def crew(self) -> Crew:
        """Creates the Blog Post crew"""

        return Crew(
            agents= [self.writer_style(), self.planner(), self.content_writer(), self.editor(), self.illustrator()],
            tasks= [self.style(), self.plan(), self.write(), self.edit(), self.illustrate()],
            process=Process.sequential,
            verbose=True
        )

Running the Crew

Running the crew is very simple. We create the main.py file and import the Crew Base BlogWriter created. Then we just use the functions crew().kickoff(inputs) to run it, passing a dictionary with the inputs to be used to generate the blog post.

# Script to run the blog writer project

# Warning control
import warnings
warnings.filterwarnings('ignore')
from crew import BlogWriter


def write_blog_post(topic: str):
    # Instantiate the crew
    my_writer = BlogWriter()
    # Run
    result = (my_writer
              .crew()
              .kickoff(inputs = {
                  'topic': topic
                  })
    )

    return result

if __name__ == "__main__":

    write_blog_post("Price Optimization with Python")

There it is. The result is a nice blog post created by the LLM. See below.

Resulting blog post. GIF by the author.

That is so nice!

Before You Go

Before you go, know that this blog post was 100% created by me. This crew I created was an experiment I wanted to do to learn more about how to create AI agents and make them work together. And, like I said, I love writing, so this is something I would be able to read and assess the quality.

My opinion is that this crew still didn’t do a very good job. They were able to complete the tasks successfully, but they gave me a very shallow post and code. I would not publish this, but at least it could be a start, maybe.

From here, I encourage you to learn more about CrewAI. I took their free course where João de Moura, the creator of the package, shows us how to create different kinds of crews. It is really interesting.

GitHub Repository

https://github.com/gurezende/Crew_Writer

About Me

If you want to learn more about my work, or follow my blog (it is really me!), here are my contacts and portfolio.

https://gustavorsantos.me

References

[Quickstart CrewAI](https://docs.crewai.com/quickstart)

[CrewAI Documentation](https://docs.crewai.com/introduction)

[GROQ](https://groq.com/)

[OpenAI](https://openai.com)

[CrewAI Free Course](https://learn.crewai.com/)

The post Creating an AI Agent to Write Blog Posts with CrewAI appeared first on Towards Data Science.

Agentic AI: Single vs Multi-Agent Systems

Ida Silfverskiold — Wed, 02 Apr 2025 00:15:17 +0000

We’ve seen this shift the last few years from building rigid programming systems to natural language-driven workflows, all made possible with more advanced large language models.

One of the interesting areas into these Agentic Ai systems is the difference between building a single versus multi-agent workflow, or perhaps the difference between working with more flexible vs controlled systems.

This article will help you understand what agentic AI is, how to build simple workflows with LangGraph, and the differences in results you can achieve with the different architectures. I’ll demonstrate this by building a tech news agent with various data sources.

As for the use case, I’m a bit obsessed with getting automatic news updates, based on my preferences, without me drowning in information overload every day.

Having AI summarize for us instead of scouting info on our own | Image by author

Working with summarizing and gathering research is one of those areas that agentic AI can really shine.

So follow along while I keep trying to make AI do the grunt work for me, and we’ll see how single-agent compares to multi-agent setups.

I always keep my work jargon-free, so if you’re new to agentic AI, this piece should help you understand what it is and how to work with it. If you’re not new to it, you can scroll past some of the sections.

Agentic AI (& LLMs)

Agentic AI is about programming with natural language. Instead of using rigid, explicit code, you’re instructing large language models (LLMs) to route data and perform actions through plain language in automating tasks.

Using natural language in workflows isn’t new, we’ve used NLP for years to extract and process data. What’s new is the amount of freedom we can now give language models, allowing them to handle ambiguity and make decisions dynamically.

Traditional automation from programmatic to NLP to LLMs | Image by author

But just because LLMs can understand nuanced language doesn’t mean they inherently validate facts or maintain data integrity. I see them primarily as a communication layer that sits on top of structured systems and existing data sources.

LLMs is a communication layer and not the system itself | Image by author

I usually explain it like this to non-technical people: they work a bit like we do. If we don’t have access to clean, structured data, we start making things up. Same with LLMs. They generate responses based on patterns, not truth-checking.

So just like us, they do their best with what they’ve got. If we want better output, we need to build systems that give them reliable data to work with. So, with Agentic systems we integrate ways for them to interact with different data sources, tools and systems.

Now, just because we can use these larger models in more places, doesn’t mean we should. LLMs shine when interpreting nuanced natural language, think customer service, research, or human-in-the-loop collaboration.

But for structured tasks — like extracting numbers and sending them somewhere — you need to use traditional approaches. LLMs aren’t inherently better at math than a calculator. So, instead of having an LLM do calculations, you give an LLM access to a calculator.

So whenever you can build parts of a workflow programmatically, that will still be the better option.

Nevertheless, LLMs are great at adapting to messy real-world input and interpreting vague instructions so combining the two can be a great way to build systems.

Agentic Frameworks

I know a lot of people jump straight to CrewAI or AutoGen here, but I’d recommend checking out LangGraph, Agno, Mastra, and Smolagents. Based on my research, these frameworks have received some of the strongest feedback so far.

I collect resources in a Github repo here with the most popular frameworks | Image by author

LangGraph is more technical and can be complex, but it’s the preferred choice for many developers. Agno is easier to get started with but less technical. Mastra is a solid option for JavaScript developers, and Smolagents shows a lot of promise as a lightweight alternative.

In this case, I’ve gone with LangGraph — built on top of LangChain — not because it’s my favorite, but because it’s becoming a go-to framework that more devs are adopting.

So, it’s worth being familiar with.

It has a lot of abstractions though, where you may want to rebuild some of it just to be able to control and understand it better.

I will not go into detail on LangGraph here, so I decided to build a quick guide for those that need to get a review.

As for this use case, you’ll be able to run the workflow without coding anything, but if you’re here to learn you may also want to understand how it works.

Choosing an LLM

Now, you might jump into this and wonder why I’m choosing certain LLMs as the base for the agents.

You can’t just pick any model, especially when working within a framework. They need to be compatible. Key things to look for are tool calling support and the ability to generate structured outputs.

I’d recommend checking HuggingFace’s Agent Leaderboard to see which models actually perform well in real-world agentic systems.

For this workflow, you should be fine using models from Anthropic, OpenAI, or Google. If you’re considering another one, just make sure it’s compatible with LangChain.

Single vs. Multi-Agent Systems

If you build a system around one LLM and give it a bunch of tools you want it to use, you’re working with a single-agent workflow. It’s fast, and if you’re new to agentic AI, it might seem like the model should just figure things out on its own.

One agent has access to many tools | Image by author

But the thing is these workflows are just another form of system design. Like any software project, you need to plan the process, define the steps, structure the logic, and decide how each part should behave.

Think about how the logic should work for your use case | Image by author

This is where multi-agent workflows come in.

Not all of them are hierarchical or linear though, some are collaborative. Collaborative workflows would then also fall into the more flexible approach that I find more difficult to work with, at least as it is now with the capabilities that exist.

However, collaborative workflows do also break apart different functions into their own modules.

Single-agent and collaborative workflows are great to start with when you’re just playing around, but they don’t always give you the precision needed for actual tasks.

For the workflow I will build here, I already know how the APIs should be used — so it’s my job to guide the system to use it the right way.

We’ll go through comparing a single-agent setup with a hierarchical multi-agent system, where a lead agent delegates tasks across a small team so you can see how they behave in practice.

Building a Single Agent Workflow

With a single thread — i.e., one agent — we give an LLM access to several tools. It’s up to the agent to decide which tool to use and when, based on the user’s question.

One LLM/Agent has access to many tool with many options | Image by author

The challenge with a single agent is control.

No matter how detailed the system prompt is, the model may not follow our requests (this can happen in more controlled environments too). If we give it too many tools or options, there’s a good chance it won’t use all of them or even use the right ones.

To illustrate this, we’ll build a tech news agent that has access to several API endpoints with custom data with several options as parameters in the tools. It’s up to the agent to decide how many to use and how to setup the final summary.

Remember, I build these workflows using LangGraph. I won’t go into LangGraph in depth here, so if you want to learn the basics to be able to tweak the code, go here.

You can find the single-agent workflow here. To run it, you’ll need LangGraph Studio and the latest version of Docker installed.

Once you’re set up, open the project folder on your computer, add your GOOGLE_API_KEY in a .env file, and save. You can get a key from Google here.

Gemini Flash 2.0 has a generous free tier, so running this shouldn’t cost anything (but you may run into errors if you use it too much).

If you want to switch to another LLM or tools, you can tweak the code directly. But, again, remember the LLM needs to be compatible.

After setup, launch LangGraph Studio and select the correct folder.

This will boot up our workflow so we can test it.

Opening LangGraph Studio | Image by author

If you run into issues booting this up, double-check that you’re using the latest version of Docker.

Once it’s loaded, you can test the workflow by entering a human message and hitting submit.

LangGraph Studio opening the single agent workflow | Image by author

You can see me run the workflow below.

LangGraph Studio running the single agent workflow | Image by author

You can see the final response below.

LangGraph Studio finishing the single agent workflow | Image by author

For this prompt it decided that it would check weekly trending keywords filtered by the category ‘companies’ only, and then it fetched the sources of those keywords and summarized for us.

It had some issues in giving us a unified summary, where it simply used the information it got last and failed to use all of the research.

In reality we want it to fetch both trending and top keywords within several categories (not just companies), check sources, track specific keywords, and reason and summarize it all nicely before returning a response.

We can of course probe it and keep asking it questions but as you can imagine if we need something more complex it would start to make shortcuts in the workflow.

The key thing is, an agent system isn’t just gonna think the way we expect, we have to actually orchestrate it to do what we want.

So a single agent is great for something simple but as you can imagine it may not think or behave as we are expecting.

This is why going for a more complex system where each agent is responsible for one thing can be really useful.

Testing a Multi-Agent Workflow

Building multiagent workflows is a lot more difficult than building a single agent with access to some tools. To do this, you need to carefully think about the architecture beforehand and how data should flow between the agents.

The multi-agent workflow I’ll set up here uses two different teams — a research team and an editing team — with several agents under each.

Every agent has access to a specific set of tools.

The multiagent workflow logic with a hierarchical team | Image by author

We’re introducing some new tools, like a research pad that acts as a shared space — one team writes their findings, the other reads from it. The last LLM will read everything that has been researched and edited to make a summary.

An alternative to using a research pad is to store data in a scratchpad in state, isolating short-term memory for each team or agent. But that also means thinking carefully about what each agent’s memory should include.

I also decided to build out the tools a bit more to provide richer data upfront, so the agents don’t have to fetch sources for each keyword individually. Here I’m using normal programmatic logic because I can.

A key thing to remember: if you can use normal programming logic, do it.

Since we’re using multiple agents, you can lower costs by using cheaper models for most agents and reserving the more expensive ones for the important stuff.

Here, I’m using Gemini Flash 2.0 for all agents except the summarizer, which runs on OpenAI’s GPT-4o. If you want higher-quality summaries, you can use an even more advanced LLM with a larger context window.

The workflow is set up for you here. Before loading it, make sure to add both your OpenAI and Google API keys in a .env file.

In this workflow, the routes (edges) are setup dynamically instead of manually like we did with the single agent. It’ll look more complex if you peek into the code.

Once you boot up the workflow in LangGraph Studio — same process as before — you’ll see the graph with all these nodes ready.

Opening the multiagent workflow in LangGraph Studio | Image by author

LangGraph Studio lets us visualize how the system delegates work between agents when we run it—just like we saw in the simpler workflow above.

Since I understand the tools each agent is using, I can prompt the system in the right way. But regular users won’t know how to do this properly. So if you’re building something similar, I’d suggest introducing an agent that transforms the user’s query into something the other agents can actually work with.

We can test it out by setting a message.

“I’m an investor and I’m interested in getting an update for what has happened within the week in tech, and what people are talking about (this means categories like companies, people, websites and subjects are interesting). Please also track these specific keywords: AI, Google, Microsoft, and Large Language Models”

Then choosing “supervisor” as the Next parameter (we’d normally do this programmatically).

Running the multiagent workflow in LangGraph Studio — it will take several minutes | Image by author

This workflow will take several minutes to run, unlike the single-agent workflow we ran earlier which finished in under a minute.

So be patient while the tools are running.

In general, these systems take time to gather and process information and that’s just something we need to get used to.

The final summary will look something like this:

The result from the multiagent workflow in LangGraph Studio | Image by author

You can read the whole thing here instead if you want to check it out.

The news will obviously vary depending on when you run the workflow. I ran it the 28th of March so the example report will be for this date.

It should save the summary to a text document, but if you’re running this inside a container, you likely won’t be able to access that file easily. It’s better to send the output somewhere else — like Google Docs or via email.

As for the results, I’ll let you decide for yourself the difference between using a more complex system versus a simple one, and how it gives us more control over the process.

Finishing Notes

I’m working with a good data source here. Without that, you’d need to add a lot more error handling, which would slow everything down even more.

Clean and structured data is key. Without it, the LLM won’t perform at its best.

Even with solid data, it’s not perfect. You still need to work on the agents to make sure they do what they’re supposed to.

You’ve probably already noticed the system works — but it’s not quite there yet.

There are still several things that need improvement: parsing the user’s query into a more structured format, adding guardrails so agents always use their tools, summarizing more effectively to keep the research doc concise, improving error handling, and introducing long-term memory to better understand what the user actually needs.

State (short-term memory) is especially important if you want to optimize for performance and cost.

Right now, we’re just pushing every message into state and giving all agents access to it, which isn’t ideal. We really want to separate state between the teams. In this case, it’s something I haven’t done, but you can try it by introducing a scratchpad in the state schema to isolate what each team knows.

Regardless, I hope it was a fun experience to understand the results we can get by building different Agentic Workflows.

If you want to see more of what I’m working on, you can follow me here but also on Medium, GitHub, or LinkedIn (though I’m hoping to move over to X soon). I also have a Substack, where I hope to publishing shorter pieces in.

The post Agentic AI: Single vs Multi-Agent Systems appeared first on Towards Data Science.

AI Agents from Scratch: Multi-Agent System

Mauro Di Pietro — Fri, 28 Mar 2025 20:15:13 +0000

Intro

In Part 1 of this tutorial series, we introduced AI Agents, autonomous programs that perform tasks, make decisions, and communicate with others.

In Part 2 of this tutorial series, we understood how to make the Agent try and retry until the task is completed through Iterations and Chains.

A single Agent can usually operate effectively using a tool, but it can be less effective when using many tools simultaneously. One way to tackle complicated tasks is through a “divide-and-conquer” approach: create a specialized Agent for each task and have them work together as a Multi-Agent System (MAS).

In a MAS, multiple agents collaborate to achieve common goals, often tackling challenges that are too difficult for a single Agent to handle alone. There are two main ways they can interact:

Sequential flow – The Agents do their work in a specific order, one after the other. For example, Agent 1 finishes its task, and then Agent 2 uses the result to do its task. This is useful when tasks depend on each other and must be done step-by-step.
Hierarchical flow – Usually, one higher-level Agent manages the whole process and gives instructions to lower level Agents which focus on specific tasks. This is useful when the final output requires some back-and-forth.

In this tutorial, I’m going to show how to build from scratch different types of Multi-Agent Systems, from simple to more advanced. I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate this example (link to full code at the end of the article).

Setup

Please refer to Part 1 for the setup of Ollama and the main LLM.

import ollama
llm = "qwen2.5"

In this example, I will ask the model to process images, therefore I’m also going to need a Vision LLM. It is a specialized version of a Large Language Model that, integrating NLP with CV, is designed to understand visual inputs, such as images and videos, in addition to text.

Microsoft’s LLaVa is an efficient choice as it can also run without a GPU.

After the download is completed, you can move on to Python and start writing code. Let’s load an image so that we can try out the Vision LLM.

from matplotlib import image as pltimg, pyplot as plt

image_file = "draghi.jpeg"

plt.imshow(pltimg.imread(image_file))
plt.show()

In order to test the Vision LLM, you can just pass the image as an input:

import ollama

ollama.generate(model="llava",
 prompt="describe the image",                
 images=[image_file])["response"]

Sequential Multi-Agent System

I shall build two Agents that will work in a sequential flow, one after the other, where the second takes the output of the first as an input, just like a Chain.

The first Agent must process an image provided by the user and return a verbal description of what it sees.
The second Agent will search the internet and try to understand where and when the picture was taken, based on the description provided by the first Agent.

Both Agents shall use one Tool each. The first Agent will have the Vision LLM as a Tool. Please remember that with Ollama, in order to use a Tool, the function must be described in a dictionary.

def process_image(path: str) -> str:
    return ollama.generate(model="llava", prompt="describe the image", images=[path])["response"]

tool_process_image = {'type':'function', 'function':{
  'name': 'process_image',
  'description': 'Load an image for a given path and describe what you see',
  'parameters': {'type': 'object',
                'required': ['path'],
                'properties': {
                    'path': {'type':'str', 'description':'the path of the image'},
}}}}

The second Agent should have a web-searching Tool. In the previous articles of this tutorial series, I showed how to leverage the DuckDuckGo package for searching the web. So, this time, we can use a new Tool: Wikipedia (pip install wikipedia==1.4.0). You can directly use the original library or import the LangChain wrapper.

from langchain_community.tools import WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper

def search_wikipedia(query:str) -> str:
    return WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper()).run(query)

tool_search_wikipedia = {'type':'function', 'function':{
  'name': 'search_wikipedia',
  'description': 'Search on Wikipedia by passing some keywords',
  'parameters': {'type': 'object',
                'required': ['query'],
                'properties': {
                    'query': {'type':'str', 'description':'The input must be short keywords, not a long text'},
}}}}
## test
search_wikipedia(query="draghi")

First, you need to write a prompt to describe the task of each Agent (the more detailed, the better), and that will be the first message in the chat history with the LLM.

prompt = '''
You are a photographer that analyzes and describes images in details.
'''
messages_1 = [{"role":"system", "content":prompt}]

One important decision to make when building a MAS is whether the Agents should share the chat history or not. The management of chat history depends on the design and objectives of the system:

Shared chat history – Agents have access to a common conversation log, allowing them to see what other Agents have said or done in previous interactions. This can enhance the collaboration and the understanding of the overall context.
Separate chat history – Agents only have access to their own interactions, focusing only on their own communication. This design is typically used when independent decision-making is important.

I recommend keeping the chats separate unless it is necessary to do otherwise. LLMs might have a limited context window, so it’s better to make the history as lite as possible.

prompt = '''
You are a detective. You read the image description provided by the photographer, and you search Wikipedia to understand when and where the picture was taken.
'''

messages_2 = [{"role":"system", "content":prompt}]

For convenience, I shall use the function defined in the previous articles to process the model’s response.

def use_tool(agent_res:dict, dic_tools:dict) -> dict:
    ## use tool
    if "tool_calls" in agent_res["message"].keys():
        for tool in agent_res["message"]["tool_calls"]:
            t_name, t_inputs = tool["function"]["name"], tool["function"]["arguments"]
            if f := dic_tools.get(t_name):
                ### calling tool
                print(' >', f"\x1b[1;31m{t_name} -> Inputs: {t_inputs}\x1b[0m")
                ### tool output
                t_output = f(**tool["function"]["arguments"])
                print(t_output)
                ### final res
                res = t_output
            else:
                print(' >', f"\x1b[1;31m{t_name} -> NotFound\x1b[0m")
    ## don't use tool
    if agent_res['message']['content'] != '':
        res = agent_res["message"]["content"]
        t_name, t_inputs = '', ''
    return {'res':res, 'tool_used':t_name, 'inputs_used':t_inputs}

As we already did in previous tutorials, the interaction with the Agents can be started with a while loop. The user is requested to provide an image that the first Agent will process.

dic_tools = {'process_image':process_image, 
 'search_wikipedia':search_wikipedia}

while True:
    ## user input
    try:
        q = input(' > give me the image to analyze:')
    except EOFError:
        break
    if q == "quit":
        break
    if q.strip() == "":
        continue
    messages_1.append( {"role":"user", "content":q} )

    plt.imshow(pltimg.imread(q))
    plt.show()

    ## Agent 1
    agent_res = ollama.chat(model=llm,
 tools=[tool_process_image],
 messages=messages_1)
    dic_res = use_tool(agent_res, dic_tools)    
 res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]

    print(" >", f"\x1b[1;30m{res}\x1b[0m")
    messages_1.append( {"role":"assistant", "content":res} )

The first Agent used the Vision LLM Tool and recognized text within the image. Now, the description will be passed to the second Agent, which shall extract some keywords to search Wikipedia.

    ## Agent 2
    messages_2.append( {"role":"system", "content":"-Picture: "+res} )

    agent_res = ollama.chat(model=llm,
 tools=[tool_search_wikipedia],
 messages=messages_2)
    dic_res = use_tool(agent_res, dic_tools)    
 res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]

The second Agent used the Tool and extracted information from the web, based on the description provided by the first Agent. Now, it can process everything and give a final answer.

    if tool_used == "search_wikipedia":
        messages_2.append( {"role":"system", "content":"-Wikipedia: "+res} )
        agent_res = ollama.chat(model=llm, tools=[], messages=messages_2)
        dic_res = use_tool(agent_res, dic_tools)        
 res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"] 
    else:
        messages_2.append( {"role":"assistant", "content":res} )
   
    print(" >", f"\x1b[1;30m{res}\x1b[0m")

This is literally perfect! Let’s move on to the next example.

Hierarchical Multi-Agent System

Imagine having a squad of Agents that operates with a hierarchical flow, just like a human team, with distinct roles to ensure smooth collaboration and efficient problem-solving. At the top, a manager oversees the overall strategy, talking to the customer (the user), making high-level decisions, and guiding the team toward the goal. Meanwhile, other team members handle operative tasks. Just like humans, Agents can work together and delegate tasks appropriately.

I shall build a tech team of 3 Agents with the objective of querying a SQL database per user’s request. They must work in a hierarchical flow:

The Lead Agent talks to the user and understands the request. Then, it decides which team member is the most appropriate for the task.
The Junior Agent has the job of exploring the db and building SQL queries.
The Senior Agent shall review the SQL code, correct it if necessary, and execute it.

LLMs know how to code by being exposed to a large corpus of both code and natural language text, where they learn patterns, syntax, and semantics of programming languages. The model learns the relationships between different parts of the code by predicting the next token in a sequence. In short, LLMs can generate SQL code but can’t execute it, Agents can.

First of all, I am going to create a database and connect to it, then I shall prepare a series of Tools to execute SQL code.

## Read dataset
import pandas as pd

dtf = pd.read_csv('http://bit.ly/kaggletrain')
dtf.head(3)

## Create dbimport sqlite3
dtf.to_sql(index=False, name="titanic",
 con=sqlite3.connect("database.db"),            
 if_exists="replace")

## Connect db
from langchain_community.utilities.sql_database import SQLDatabase

db = SQLDatabase.from_uri("sqlite:///database.db")

Let’s start with the Junior Agent. LLMs don’t need Tools to generate SQL code, but the Agent doesn’t know the table names and structure. Therefore, we need to provide Tools to investigate the database.

from langchain_community.tools.sql_database.tool import ListSQLDatabaseTool

def get_tables() -> str:
    return ListSQLDatabaseTool(db=db).invoke("")

tool_get_tables = {'type':'function', 'function':{
  'name': 'get_tables',
  'description': 'Returns the name of the tables in the database.',
  'parameters': {'type': 'object',
                'required': [],
                'properties': {}
}}}

## test
get_tables()

That will show the available tables in the db, and this will print the columns in a table.

from langchain_community.tools.sql_database.tool import InfoSQLDatabaseTool

def get_schema(tables: str) -> str:
    tool = InfoSQLDatabaseTool(db=db)
    return tool.invoke(tables)

tool_get_schema = {'type':'function', 'function':{
  'name': 'get_schema',
  'description': 'Returns the name of the columns in the table.',
  'parameters': {'type': 'object',
                'required': ['tables'],
                'properties': {'tables': {'type':'str', 'description':'table name. Example Input: table1, table2, table3'}}
}}}

## test
get_schema(tables='titanic')

Since this Agent must use more than one Tool which might fail, I’ll write a solid prompt, following the structure of the previous article.

prompt_junior = '''
[GOAL] You are a data engineer who builds efficient SQL queries to get data from the database.

[RETURN] You must return a final SQL query based on user's instructions.

[WARNINGS] Use your tools only once.

[CONTEXT] In order to generate the perfect SQL query, you need to know the name of the table and the schema.
First ALWAYS use the tool 'get_tables' to find the name of the table.
Then, you MUST use the tool 'get_schema' to get the columns in the table.
Finally, based on the information you got, generate an SQL query to answer user question.
'''

Moving to the Senior Agent. Code checking doesn’t require any particular trick, you can just use the LLM.

def sql_check(sql: str) -> str:
    p = f'''Double check if the SQL query is correct: {sql}. You MUST just SQL code without comments'''
    res = ollama.generate(model=llm, prompt=p)["response"]
    return res.replace('sql','').replace('```','').replace('\n',' ').strip()

tool_sql_check = {'type':'function', 'function':{
  'name': 'sql_check',
  'description': 'Before executing a query, always review the SQL query and correct the code if necessary',
  'parameters': {'type': 'object',
                'required': ['sql'],
                'properties': {'sql': {'type':'str', 'description':'SQL code'}}
}}}

## test
sql_check(sql='SELECT * FROM titanic TOP 3')

Executing code on the database is a different story: LLMs can’t do that alone.

from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool

def sql_exec(sql: str) -> str:
    return QuerySQLDataBaseTool(db=db).invoke(sql)
   
tool_sql_exec = {'type':'function', 'function':{
  'name': 'sql_exec',
  'description': 'Execute a SQL query',
  'parameters': {'type': 'object',
                'required': ['sql'],
                'properties': {'sql': {'type':'str', 'description':'SQL code'}}
}}}

## test
sql_exec(sql='SELECT * FROM titanic LIMIT 3')

And of course, a good prompt.

prompt_senior = '''[GOAL] You are a senior data engineer who reviews and execute the SQL queries written by others.

[RETURN] You must return data from the database.

[WARNINGS] Use your tools only once.

[CONTEXT] ALWAYS check the SQL code before executing on the database.First ALWAYS use the tool 'sql_check' to review the query. The output of this tool is the correct SQL query.You MUST use ONLY the correct SQL query when you use the tool 'sql_exec'.'''

Finally, we shall create the Lead Agent. It has the most important job: invoking other Agents and telling them what to do. There are many ways to achieve that, but I find creating a simple Tool the most accurate one.

def invoke_agent(agent:str, instructions:str) -> str:
    return agent+" - "+instructions if agent in ['junior','senior'] else f"Agent '{agent}' Not Found"
   
tool_invoke_agent = {'type':'function', 'function':{
  'name': 'invoke_agent',
  'description': 'Invoke another Agent to work for you.',
  'parameters': {'type': 'object',
                'required': ['agent', 'instructions'],
                'properties': {
                    'agent': {'type':'str', 'description':'the Agent name, one of "junior" or "senior".'},
                    'instructions': {'type':'str', 'description':'detailed instructions for the Agent.'}
                }
}}}

## test
invoke_agent(agent="intern", instructions="build a query")

Describe in the prompt what kind of behavior you’re expecting. Try to be as detailed as possible, for hierarchical Multi-Agent Systems can get very confusing.

prompt_lead = '''
[GOAL] You are a tech lead.
You have a team with one junior data engineer called 'junior', and one senior data engineer called 'senior'.

[RETURN] You must return data from the database based on user's requests.

[WARNINGS] You are the only one that talks to the user and gets the requests from the user.
The 'junior' data engineer only builds queries.
The 'senior' data engineer checks the queries and execute them.

[CONTEXT] First ALWAYS ask the users what they want.
Then, you MUST use the tool 'invoke_agent' to pass the instructions to the 'junior' for building the query.
Finally, you MUST use the tool 'invoke_agent' to pass the instructions to the 'senior' for retrieving the data from the database.
'''

I shall keep chat history separate so each Agent will know only a specific part of the whole process.

dic_tools = {'get_tables':get_tables,
            'get_schema':get_schema,
            'sql_exec':sql_exec,
            'sql_check':sql_check,
            'Invoke_agent':invoke_agent}

messages_junior = [{"role":"system", "content":prompt_junior}]
messages_senior = [{"role":"system", "content":prompt_senior}]
messages_lead   = [{"role":"system", "content":prompt_lead}]

Everything is ready to start the workflow. After the user begins the chat, the first to respond is the Leader, which is the only one that directly interacts with the human.

while True:
    ## user input
    q = input(' >')
    if q == "quit":
        break
    messages_lead.append( {"role":"user", "content":q} )
   
    ## Lead Agent
    agent_res = ollama.chat(model=llm, messages=messages_lead, tools=[tool_invoke_agent])
    dic_res = use_tool(agent_res, dic_tools)
    res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
    agent_invoked = res.split("-")[0].strip() if len(res.split("-")) > 1 else ''
    instructions = res.split("-")[1].strip() if len(res.split("-")) > 1 else ''

    ###-->CODE TO INVOKE OTHER AGENTS HERE<--###

    ## Lead Agent final response    print(" >", f"\x1b[1;30m{res}\x1b[0m")    messages_lead.append( {"role":"assistant", "content":res} )

The Lead Agent decided to invoke the Junior Agent giving it some instruction, based on the interaction with the user. Now the Junior Agent shall start working on the query.

    ## Invoke Junior Agent
    if agent_invoked == "junior":
        print(" >", f"\x1b[1;32mReceived instructions: {instructions}\x1b[0m")
        messages_junior.append( {"role":"user", "content":instructions} )
       
        ### use the tools
        available_tools = {"get_tables":tool_get_tables, "get_schema":tool_get_schema}
        context = ''
        while available_tools:
            agent_res = ollama.chat(model=llm, messages=messages_junior,
                                    tools=[v for v in available_tools.values()])
            dic_res = use_tool(agent_res, dic_tools)
            res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
            if tool_used:
                available_tools.pop(tool_used)
            context = context + f"\nTool used: {tool_used}. Output: {res}" #->add tool usage context
            messages_junior.append( {"role":"user", "content":context} )
           
        ### response
        agent_res = ollama.chat(model=llm, messages=messages_junior)
        dic_res = use_tool(agent_res, dic_tools)
        res = dic_res["res"]
        print(" >", f"\x1b[1;32m{res}\x1b[0m")
        messages_junior.append( {"role":"assistant", "content":res} )

The Junior Agent activated all its Tools to explore the database and collected the necessary information to generate some SQL code. Now, it must report back to the Lead.

        ## update Lead Agent
        context = "Junior already wrote this query: "+res+ "\nNow invoke the Senior to review and execute the code."
        print(" >", f"\x1b[1;30m{context}\x1b[0m")
        messages_lead.append( {"role":"user", "content":context} )
        agent_res = ollama.chat(model=llm, messages=messages_lead, tools=[tool_invoke_agent])
        dic_res = use_tool(agent_res, dic_tools)
        res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
                agent_invoked = res.split("-")[0].strip() if len(res.split("-")) > 1 else ''
        instructions = res.split("-")[1].strip() if len(res.split("-")) > 1 else ''

The Lead Agent received the output from the Junior and asked the Senior Agent to review and execute the SQL query.

    ## Invoke Senior Agent
    if agent_invoked == "senior":
        print(" >", f"\x1b[1;34mReceived instructions: {instructions}\x1b[0m")
        messages_senior.append( {"role":"user", "content":instructions} )
       
        ### use the tools
        available_tools = {"sql_check":tool_sql_check, "sql_exec":tool_sql_exec}
        context = ''
        while available_tools:
            agent_res = ollama.chat(model=llm, messages=messages_senior,
                                    tools=[v for v in available_tools.values()])
            dic_res = use_tool(agent_res, dic_tools)
            res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
            if tool_used:
                available_tools.pop(tool_used)
            context = context + f"\nTool used: {tool_used}. Output: {res}" #->add tool usage context
            messages_senior.append( {"role":"user", "content":context} )
           
        ### response
        print(" >", f"\x1b[1;34m{res}\x1b[0m")
        messages_senior.append( {"role":"assistant", "content":res} )

The Senior Agent executed the query on the db and got an answer. Finally, it can report back to the Lead which will give the final answer to the user.

        ### update Lead Agent
        context = "Senior agent returned this output: "+res
        print(" >", f"\x1b[1;30m{context}\x1b[0m")
        messages_lead.append( {"role":"user", "content":context} )

Conclusion

This article has covered the basic steps of creating Multi-Agent Systems from scratch using only Ollama. With these building blocks in place, you are already equipped to start developing your own MAS for different use cases.

Stay tuned for Part 4, where we will dive deeper into more advanced examples.

Full code for this article: GitHub

I hope you enjoyed it! Feel free to contact me for questions and feedback or just to share your interesting projects.

Let’s Connect

All images, unless otherwise noted, are by the author

The post AI Agents from Scratch: Multi-Agent System appeared first on Towards Data Science.

AI Agents from Scratch: Iterations & Chains

Mauro Di Pietro — Wed, 26 Mar 2025 19:36:29 +0000

Intro

In Part 1 of this tutorial series, we introduced AI Agents, autonomous programs that perform tasks, make decisions, and communicate with others.

Agents perform actions through Tools. It might happen that a Tool doesn’t work on the first try, or that multiple Tools must be activated in sequence. Agents should be able to organize tasks into a logical progression and change their strategies in a dynamic environment.

To put it simply, the Agent’s structure must be solid, and the behavior must be reliable. The most common way to do that is through:

Iterations – repeating a certain action multiple times, often with slight changes or improvements in each cycle. Every time might involve the Agent revisiting certain steps to refine its output or reach an optimal solution.
Chains – a series of actions that are linked together in a sequence. Each step in the chain is dependent on the previous one, and the output of one action becomes the input for the next.

In this tutorial, I’m going to show how to use iterations and chains for Agents. I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate this example (link to full code at the end of the article).

Setup

Please refer to Part 1 for the setup of Ollama and the main LLM.

import Ollama
llm = "qwen2.5"

We will use the YahooFinance public APIs with the Python library (pip install yfinance==0.2.55) to download financial data.

import yfinance as yf

stock = "MSFT"
yf.Ticker(ticker=stock).history(period='5d') #1d,5d,1mo,3mo,6mo,1y,2y,5y,10y,ytd,max

Let’s embed that into a Tool.

import matplotlib.pyplot as plt

def get_stock(ticker:str, period:str, col:str):
    data = yf.Ticker(ticker=ticker).history(period=period)
    if len(data) > 0:
        data[col].plot(color="black", legend=True, xlabel='', title=f"{ticker.upper()} ({period})").grid()
        plt.show()
        return 'ok'
    else:
        return 'no'

tool_get_stock = {'type':'function', 'function':{
  'name': 'get_stock',
  'description': 'Download stock data',
  'parameters': {'type': 'object',
                'required': ['ticker','period','col'],
                'properties': {
                    'ticker': {'type':'str', 'description':'the ticker symbol of the stock.'},
                    'period': {'type':'str', 'description':"for 1 month input '1mo', for 6 months input '6mo', for 1 year input '1y'. Use '1y' if not specified."},
                    'col': {'type':'str', 'description':"one of 'Open','High','Low','Close','Volume'. Use 'Close' if not specified."},
}}}}

## test
get_stock(ticker="msft", period="1y", col="Close")

Moreover, taking the code from the previous article as a reference, I shall write a general function to process the model response, such as when the Agent wants to use a Tool or when it just returns text.

def use_tool(agent_res:dict, dic_tools:dict) -> dict:
    ## use tool
    if "tool_calls" in agent_res["message"].keys():
        for tool in agent_res["message"]["tool_calls"]:
            t_name, t_inputs = tool["function"]["name"], tool["function"]["arguments"]
            if f := dic_tools.get(t_name):
                ### calling tool
                print(' >', f"\x1b[1;31m{t_name} -> Inputs: {t_inputs}\x1b[0m")
                ### tool output
                t_output = f(**tool["function"]["arguments"])
                print(t_output)
                ### final res
                res = t_output
            else:
                print(' >', f"\x1b[1;31m{t_name} -> NotFound\x1b[0m")
    ## don't use tool
    if agent_res['message']['content'] != '':
        res = agent_res["message"]["content"]
        t_name, t_inputs = '', ''
    return {'res':res, 'tool_used':t_name, 'inputs_used':t_inputs}

Let’s start a quick conversation with our Agent. For now, I’m going to use a simple generic prompt.

prompt = '''You are a financial analyst, assist the user using your available tools.'''
messages = [{"role":"system", "content":prompt}]
dic_tools = {'get_stock':get_stock}

while True:
    ## user input
    try:
        q = input(' >')
    except EOFError:
        break
    if q == "quit":
        break
    if q.strip() == "":
        continue
    messages.append( {"role":"user", "content":q} )
   
    ## model
    agent_res = ollama.chat(model=llm, messages=messages,
                            tools=[tool_get_stock])
    dic_res = use_tool(agent_res, dic_tools)
    res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
   
    ## final response
    print(" >", f"\x1b[1;30m{res}\x1b[0m")
    messages.append( {"role":"assistant", "content":res} )

As you can see, I started by asking an “easy” question. The LLM already knows that the symbol of Microsoft stock is MSFT, therefore the Agent was able to activate the Tool with the right inputs. But what if I ask something that might not be included in the LLM knowledge base?

Seems that the LLM doesn’t know that Facebook changed its name to META, so it used the Tool with the wrong inputs. I will enable the Agent to try an action several times through iterations.

Iterations

Iterations refer to the repetition of a process until a certain condition is met. We can let the Agent try a specific number of times, but we need to let it know that the previous parameters didn’t work, by adding the details in the message history.

    max_i, i = 3, 0
    while res == 'no' and i < max_i:
        comment = f'''I used tool '{tool_used}' with inputs {inputs_used}. But it didn't work, so I must try again with different inputs'''
        messages.append( {"role":"assistant", "content":comment} )
        agent_res = ollama.chat(model=llm, messages=messages,
                                tools=[tool_get_stock])
        dic_res = use_tool(agent_res, dic_tools)
        res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
       
        i += 1
        if i == max_i:
            res = f'I tried {i} times but something is wrong'

    ## final response
    print(" >", f"\x1b[1;30m{res}\x1b[0m")
    messages.append( {"role":"assistant", "content":res} )

The Agent tried 3 times with different inputs but it couldn’t find a solution because there is a gap in the LLM knowledge base. In this case, the model needed human input to understand how to use the Tool.

Next, we’re going to enable the Agent to fill the knowledge gap by itself.

Chains

A chain refers to a linear sequence of actions where the output of one step is used as the input for the next step. In this example, I will add another Tool that the Agent can use in case the first one fails.

We can use the web-searching Tool from the previous article.

from langchain_community.tools import DuckDuckGoSearchResults

def search_web(query:str) -> str:
  return DuckDuckGoSearchResults(backend="news").run(query)

tool_search_web = {'type':'function', 'function':{
  'name': 'search_web',
  'description': 'Search the web',
  'parameters': {'type': 'object',
                'required': ['query'],
                'properties': {
                    'query': {'type':'str', 'description':'the topic or subject to search on the web'},
}}}}

## test
search_web(query="facebook stock")

So far, I’ve always used very generic prompts as the tasks were relatively simple. Now, I want to make sure that the Agent understands how to use the Tools in the right order, so I’m going to write a proper prompt. This is how a prompt should be done:

The goal of the Agent
What it must return (i.e. format, content)
Any relevant warnings that might affect the output
Context dump

prompt = '''
[GOAL] You are a financial analyst, assist the user using your available tools.

[RETURN] You must return the stock data that the user asks for.

[WARNINGS] In order to retrieve stock data, you need to know the ticker symbol of the company.

[CONTEXT] First ALWAYS try to use the tool 'get_stock'.
If it doesn't work, you can use the tool 'search_web' and search 'company name stock'.
Get information about the stock and deduct what is the right ticker symbol of the company.
Then, you can use AGAIN the tool 'get_stock' with the ticker you got using the previous tool.
'''

We can simply add the chain to the iteration loop that we already have. This time the Agent has two Tools, and when the first fails, the model can decide whether to retry or to use the second one. Then, if the second Tool is used, the Agent must process the output and learn what’s the right input for the first Tool that initially failed.

    max_i, i = 3, 0
    while res in ['no',''] and i < max_i:
        comment = f'''I used tool '{tool_used}' with inputs {inputs_used}. But it didn't work, so I must try a different way.'''
        messages.append( {"role":"assistant", "content":comment} )
        agent_res = ollama.chat(model=llm, messages=messages,
                                tools=[tool_get_stock, tool_search_web])
        dic_res = use_tool(agent_res, dic_tools)
        res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]

        ## chain: output of previous tool = input of next tool
        if tool_used == 'search_web':
            query = q+". You must return just the compay ticker.\nContext: "+res
            llm_res = ollama.generate(model=llm, prompt=query)["response"]
            messages.append( {"role":"user", "content":f"try ticker: {llm_res}"} )
           
            print(" >", f"\x1b[1;30mI can try with {llm_res}\x1b[0m")
           
            agent_res = ollama.chat(model=llm, messages=messages, tools=[tool_get_stock])
            dic_res = use_tool(agent_res, dic_tools)
            res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
        i += 1        if i == max_i:            res = f'I tried {i} times but something is wrong'
    
 ## final response    
 print(" >", f"\x1b[1;30m{res}\x1b[0m")    
 messages.append( {"role":"assistant", "content":res} )

As expected, the Agent tried to use the first Tool with the wrong inputs, but instead of trying the same action again as before, it decided to use the second Tool. By consuming information, it should understand the solution without the need for human input.

In summary, the AI tried to do an action but failed due to a gap in its knowledge base. So it activated Tools to fill that gap and deliver the output requested by the user… that is indeed the true essence of AI Agents.

Conclusion

This article has covered more structured ways to make Agents more reliable, using iterations and chains. With these building blocks in place, you are already equipped to start developing your own Agents for different use cases.

Stay tuned for Part 3, where we will dive deeper into more advanced examples.

Full code for this article: GitHub

I hope you enjoyed it! Feel free to contact me for questions and feedback or just to share your interesting projects.

Let’s Connect

The post AI Agents from Scratch: Iterations & Chains appeared first on Towards Data Science.

A Clear Intro to MCP (Model Context Protocol) with Code Examples

Sandi Besen — Tue, 25 Mar 2025 18:03:19 +0000

As the race to move AI agents from prototype to production heats up, the need for a standardized way for agents to call tools across different providers is pressing. This transition to a standardized approach to agent tool calling is similar to what we saw with REST APIs. Before they existed, developers had to deal with a mess of proprietary protocols just to pull data from different services. REST brought order to chaos, enabling systems to talk to each other in a consistent way. MCP (Model Context Protocol) is aiming to, as it sounds, provide context for AI models in a standard way. Without it, we’re headed towards tool-calling mayhem where multiple incompatible versions of “standardized” tool calls crop up simply because there’s no shared way for agents to organize, share, and invoke tools. MCP gives us a shared language and the democratization of tool calling.

One thing I’m personally excited about is how tool-calling standards like MCP can actually make Ai Systems safer. With easier access to well-tested tools more companies can avoid reinventing the wheel, which reduces security risks and minimizes the chance of malicious code. As Ai systems start scaling in 2025, these are valid concerns.

As I dove into MCP, I realized a huge gap in documentation. There’s plenty of high-level “what does it do” content, but when you actually want to understand how it works, the resources start to fall short—especially for those who aren’t native developers. It’s either high level explainers or deep in the source code.

In this piece, I’m going to break MCP down for a broader audience—making the concepts and functionality clear and digestible. If you’re able, follow along in the coding section, if not it will be well explained in natural language above the code snippets.

An Analogy to Understand MCP: The Restaurant

Let’s imagine the concept of MCP as a restaurant where we have:

The Host = The restaurant building (the environment where the agent runs)

The Server = The kitchen (where tools live)

The Client = The waiter (who sends tool requests)

The Agent = The customer (who decides what tool to use)

The Tools = The recipes (the code that gets executed)

The Components of MCP

Host
This is where the agent operates. In our analogy, it’s the restaurant building; in MCP, it’s wherever your agents or LLMs actually run. If you’re using Ollama locally, you’re the host. If you’re using Claude or GPT, then Anthropic or OpenAI are the hosts.

Client

This is the environment that sends tool call requests from the agent. Think of it as the waiter who takes your order and delivers it to the kitchen. In practical terms, it’s the application or interface where your agent runs. The client passes tool call requests to the Server using MCP.

Server

This is the kitchen where recipes, or tools, are housed. It centralizes tools so agents can access them easily. Servers can be local (spun up by users) or remote (hosted by companies offering tools). Tools on a server are typically either grouped by function or integration. For instance, all Slack-related tools can be on a “Slack server,” or all messaging tools can be grouped together on a “messaging server”. That decision is based on architectural and developer preferences.

Agent

The “brains” of the operation. Powered by an LLM, it decides which tools to call to complete a task. When it determines a tool is needed, it initiates a request to the server. The agent doesn’t need to natively understand MCP because it learns how to use it through the metadata associated with each of the tools. This metadata associated with each tool tells the agent the protocol for calling the tool and the execution method. But it is important to note that the platform or agent needs to support MCP so that it handles tool calls automatically. Otherwise it is up to the developer to write the complex translation logic of how to parse the metadata from the schema, form tool call requests in MCP format, map the requests to the correct function, execute the code, and return the result in MCP complaint format back to the agent.

Tools

These are the functions, such as calling APIs or custom code, that “does the work”. Tools live on servers and can be:

Custom tools you create and host on a local server.
Premade tools hosted by others on a remote server.
Premade code created by others but hosted by you on a local server.

How the components fit together

Server Registers Tools
Each tool is defined with a name, description, input/output schemas, a function handler (the code that runs) and registered to the server. This usually involves calling a method or API to tell the server “hey, here’s a new tool and this is how you use it”.
Server Exposes Metadata
When the server starts or an agent connects, it exposes the tool metadata (schemas, descriptions) via MCP.
Agent Discovers Tools
The agent queries the server (using MCP) to see what tools are available. It understands how to use each tool from the tool metadata. This typically happens on startup or when tools are added.
Agent Plans Tool Use
When the agent determines a tool is needed (based on user input or task context), it forms a tool call request in a standardized MCP JSON format which includes tool name, input parameters that match the tool’s input schema, and any other metadata. The client acts as the transport layer and sends the MCP formatted request to the server over HTTP.
Translation Layer Executes
The translation layer takes the agent’s standardized tool call (via MCP), maps the request to the corresponding function on the server, executes the function, formats the result back to MCP, and sends it back to the agent. A framework that abstracts MCP for you deos all of this without the developer needing to write the translation layer logic (which sounds like a headache).

Image by Sandi Besen

Code Example of A Re-Act Agent Using MCP Brave Search Server

In order to understand what MCP looks like when applied, let’s use the beeAI framework from IBM, which natively supports MCP and handles the translation logic for us.

If you plan on running this code you will need to:

Clone the beeai framework repo to gain access to the helper classes used in this code
Create a free Brave developer account and get your API key. There are free subscriptions available (credit card required).
Create an OpenAI developer account and create an API Key
Add your Brave API key and OpenAI key to the .env file at the python folder level of the repo.
Ensure you have npm installed and have set your path correctly.

Sample .env file

BRAVE_API_KEY= ""
BEEAI_LOG_LEVEL=INFO
OPENAI_API_KEY= ""

Sample mcp_agent.ipynb

1. Import the necessary libraries

import asyncio
import logging
import os
import sys
import traceback
from typing import Any
from beeai_framework.agents.react.runners.default.prompts import SystemPromptTemplate
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from beeai_framework import Tool
from beeai_framework.agents.react.agent import ReActAgent
from beeai_framework.agents.types import AgentExecutionConfig
from beeai_framework.backend.chat import ChatModel, ChatModelParameters
from beeai_framework.emitter.emitter import Emitter, EventMeta
from beeai_framework.errors import FrameworkError
from beeai_framework.logger import Logger
from beeai_framework.memory.token_memory import TokenMemory
from beeai_framework.tools.mcp_tools import MCPTool
from pathlib import Path
from beeai_framework.adapters.openai.backend.chat import OpenAIChatModel
from beeai_framework.backend.message import SystemMessa

2. Load the environment variables and set the system path (if needed)

import os
from dotenv import load_dotenv

# Absolute path to your .env file
# sometimes the system can have trouble locating the .env file
env_path = 
# Load it
load_dotenv(dotenv_path=env_path)

# Get current working directory
path =  #...beeai-framework/python'
# Append to sys.path
sys.path.append(path)

3. Configure the logger

# Configure logging - using DEBUG instead of trace
logger = Logger("app", level=logging.DEBUG)

4. Load helper functions like process_agent_events,observer, and create an instance of ConsoleReader

process_agent_events: Handles agent events and logs messages to the console based on the event type (e.g., error, retry, update). It ensures meaningful output for each event to help track agent activity.
observer: Listens for all events from an emitter and routes them to process_agent_events for processing and display.
ConsoleReader: Manages console input/output, allowing user interaction and formatted message display with color-coded roles.

#load console reader
from examples.helpers.io import ConsoleReader
#this is a helper function that makes the assitant chat easier to read
reader = ConsoleReader()

def process_agent_events(data: dict[str, Any], event: EventMeta) -> None:
  """Process agent events and log appropriately"""

  if event.name == "error":
      reader.write("Agent  : ", FrameworkError.ensure(data["error"]).explain())
  elif event.name == "retry":
      reader.write("Agent  : ", "retrying the action...")
  elif event.name == "update":
      reader.write(f"Agent({data['update']['key']})  : ", data["update"]["parsedValue"])
  elif event.name == "start":
      reader.write("Agent  : ", "starting new iteration")
  elif event.name == "success":
      reader.write("Agent  : ", "success")
  else:
      print(event.path)

def observer(emitter: Emitter) -> None:
  emitter.on("*.*", process_agent_events)

5. Set the Brave API Key and server parameters.

Anthropic has a list of MCP servers here.

brave_api_key = os.environ["BRAVE_API_KEY"]

brave_server_params = StdioServerParameters(
  command="/opt/homebrew/bin/npx",  # Full path to be safe
  args=[
      "-y",
      "@modelcontextprotocol/server-brave-search"
  ],
  env={
      "BRAVE_API_KEY": brave_api_key,
        "x-subscription-token": brave_api_key
  },
)

6. Create the brave tool that initiates the connection to the MCP server, discovers tools, and returns the discovered tools to the Agents so it can decide what tool is appropriate to call for a given task.

In this case 2 tools are discoverable on the Brave MCP Server:

brave_web_search: Execute web searches with pagination and filtering
brave_local_search: Search for local businesses and services

async def brave_tool() -> MCPTool:
  brave_env = os.environ.copy()
  brave_server_params = StdioServerParameters(
      command="/opt/homebrew/bin/npx",
      args=["-y", "@modelcontextprotocol/server-brave-search"],
      env=brave_env
  )

  print("Starting MCP client...")
  try:
      async with stdio_client(brave_server_params) as (read, write), ClientSession(read, write) as session:
          print("Client connected, initializing...")

          await asyncio.wait_for(session.initialize(), timeout=10)
          print("Initialized! Discovering tools...")

          bravetools = await asyncio.wait_for(
              MCPTool.from_client(session, brave_server_params),
              timeout=10
          )
          print("Tools discovered!")
          return bravetools
  except asyncio.TimeoutError as e:
      print(" Timeout occurred during session initialization or tool discovery.")
  except Exception as e:
      print(" Exception occurred:", e)
      traceback.print_exc()

(Optional) Check the connection to the MCP server and ensure it returns all the available tools before providing it to the agent.

tool = await brave_tool()
print("Discovered tools:", tool)

for tool in tool:
  print(f"Tool Name: {tool.name}")
  print(f"Description: {getattr(tool, 'description', 'No description available')}")
  print("-" * 30)

OUTPUT:

Starting MCP client...

Client connected, initializing...

Initialized! Discovering tools...

Tools discovered!

Discovered tools: [, ]

Tool Name: brave_web_search

Description: Performs a web search using the Brave Search API, ideal for general queries, news, articles, and online content. Use this for broad information gathering, recent events, or when you need diverse web sources. Supports pagination, content filtering, and freshness controls. Maximum 20 results per request, with offset for pagination. 

------------------------------

Tool Name: brave_local_search

Description: Searches for local businesses and places using Brave's Local Search API. Best for queries related to physical locations, businesses, restaurants, services, etc. Returns detailed information including:

- Business names and addresses

- Ratings and review counts

- Phone numbers and opening hours

Use this when the query implies 'near me' or mentions specific locations. Automatically falls back to web search if no local results are found.

7. Write the function that creates the agent:

assign an LLM
create an instance of the brave_tool() function and assign it to a tools variable
create a re-act agent and assign it the chosen llm, tools, memory (so it can have constinous conversation)
Add a system prompt to the re-act agent.

Note: You might notice that I added a sentence to the system prompt that reads “If you need to use the brave_tool you must use a count of 5.” This is a bandaid work-around becasue of a bug I found in the index.ts file of the brave server. I will contribute to the repo to fix it.

async def create_agent() -> ReActAgent:
  """Create and configure the agent with tools and LLM"""
  #using openai api instead
  llm = OpenAIChatModel(model_id="gpt-4o")
 
  # Configure tools
  tools: list[Tool] = await brave_tool()
  #tools: list[Tool] = [await brave_tool()]

  # Create agent with memory and tools
  agent = ReActAgent(llm=llm, tools=tools, memory=TokenMemory(llm), )
 
  await agent.memory.add(SystemMessage(content="You are a helpful assistant. If you need to use the brave_tool you must use a count of 5."))

  return agent

8. Create the main function

Creates the agent
Enters a conversation loop with the user and runs the agent with the user prompt and some configuration settings. Finishes the conversation if the user types “exit” or “quit”.

import asyncio
import traceback
import sys

# Your async main function
async def main() -> None:
  """Main application loop"""

  # Create agent
  agent = await create_agent()

  # Main interaction loop with user input
  for prompt in reader:
      # Exit condition
      if prompt.strip().lower() in {"exit", "quit"}:
          reader.write("Session ended by user. Goodbye! n")
          break

      # Run agent with the prompt
      try:
          response = await agent.run(
              prompt=prompt,
              execution=AgentExecutionConfig(max_retries_per_step=3, total_max_retries=10, max_iterations=20),
          ).observe(observer)

          reader.write("Agent  : ", response.result.text)
      except Exception as e:
          reader.write("An error occurred: ", str(e))
          traceback.print_exc()

# Run main() with error handling
try:
  await main()
except FrameworkError as e:
  traceback.print_exc()
  sys.exit(e.explain())

OUTPUT:

Starting MCP client...

Client connected, initializing...

Initialized! Discovering tools...

Tools discovered!

Interactive session has started. To escape, input 'q' and submit.

Agent  : starting new iteration

Agent(thought)  : I will use the brave_local_search function to find the open hours for La Taqueria on Mission St in San Francisco.

Agent(tool_name)  : brave_local_search

Agent(tool_input)  : {'query': 'La Taqueria Mission St San Francisco'}

Agent(tool_output)  : [{"annotations": null, "text": "Error: Brave API error: 422 Unprocessable Entityn{"type":"ErrorResponse","error":{"id":"ddab2628-c96e-478f-80ee-9b5f8b1fda26","status":422,"code":"VALIDATION","detail":"Unable to validate request parameter(s)","meta":{"errors":[{"type":"greater_than_equal","loc":["query","count"],"msg":"Input should be greater than or equal to 1","input":"0","ctx":{"ge":1}}]}},"time":1742589546}", "type": "text"}]

Agent  : starting new iteration

Agent(thought)  : The function call resulted in an error. I will try again with a different approach to find the open hours for La Taqueria on Mission St in San Francisco.

Agent(tool_name)  : brave_local_search

Agent(tool_input)  : {'query': 'La Taqueria Mission St San Francisco', 'count': 5}

Agent(tool_output)  : [{"annotations": null, "text": "Title: LA TAQUERIA - Updated May 2024 - 2795 Photos & 4678 Reviews - 2889 Mission St, San Francisco, California - Mexican - Restaurant Reviews - Phone Number - YelpnDescription: LA TAQUERIA, 2889 Mission St, San Francisco, CA 94110, 2795 Photos, Mon - Closed, Tue - Closed, Wed - 11:00 am - 8:45 pm, Thu - 11:00 am - 8:45 pm, Fri - 11:00 am - 8:45 pm, Sat - 11:00 am - 8:45 pm, Sun - 11:00 am - 7:45 pmnURL: https://www.yelp.com/biz/la-taqueria-san-francisco-2nnTitle: La Taqueria: Authentic Mexican Cuisine for Every TastenDescription: La Taqueria - Mexican Food Restaurant welcomes you to enjoy our delicious. La Taqueria provides a full-service experience in a fun casual atmosphere and fresh flavors where the customer always comes first!nURL: https://lataqueria.gotoeat.net/nnTitle: r/sanfrancisco on Reddit: Whats so good about La Taqueria in The Mission?nDescription: 182 votes, 208 comments. Don't get me wrong its good but I failed to see the hype. I waited in a long line and once I got my food it just tastes like…nURL: https://www.reddit.com/r/sanfrancisco/comments/1d0sf5k/whats_so_good_about_la_taqueria_in_the_mission/nnTitle: LA TAQUERIA, San Francisco - Mission District - Menu, Prices & Restaurant Reviews - TripadvisornDescription: La Taqueria still going strong. Historically the most well known Burrito home in the city and Mission District. Everything is run like a clock. The fillings are just spiced and prepared just right. Carnitas, chicken, asada, etc have true home made flavors. The Tortillas both are super good ...nURL: https://www.tripadvisor.com/Restaurant_Review-g60713-d360056-Reviews-La_Taqueria-San_Francisco_California.htmlnnTitle: La Taqueria – San Francisco - a MICHELIN Guide RestaurantnDescription: San Francisco Restaurants · La Taqueria · 4 · 2889 Mission St., San Francisco, 94110, USA · $ · Mexican, Regional Cuisine · Visited · Favorite · Find bookable restaurants near me · 2889 Mission St., San Francisco, 94110, USA · $ · Mexican, Regional Cuisine ·nURL: https://guide.michelin.com/us/en/california/san-francisco/restaurant/la-taqueria", "type": "text"}]

Agent  : starting new iteration

Agent(thought)  : I found the open hours for La Taqueria on Mission St in San Francisco. I will provide this information to the user.

Agent(final_answer)  : La Taqueria, located at 2889 Mission St, San Francisco, CA 94110, has the following opening hours:

- Monday: Closed

- Tuesday: Closed

- Wednesday to Saturday: 11:00 AM - 8:45 PM

- Sunday: 11:00 AM - 7:45 PM

For more details, you can visit their [Yelp page](https://www.yelp.com/biz/la-taqueria-san-francisco-2).

Agent  : success

Agent  : success

run.agent.react.finish

Agent  : La Taqueria, located at 2889 Mission St, San Francisco, CA 94110, has the following opening hours:

- Monday: Closed

- Tuesday: Closed

- Wednesday to Saturday: 11:00 AM - 8:45 PM

- Sunday: 11:00 AM - 7:45 PM

For more details, you can visit their [Yelp page](https://www.yelp.com/biz/la-taqueria-san-francisco-2).

Conclusion, Challenges, and Where MCP is Headed

In this article you’ve seen how MCP can provide a standardized way for agents to discover tools on an MCP server and then interact with them without the developer needing to specify the implementation details of the tool call. The level of abstraction that MCP offers is powerful. It means developers can focus on creating valuable tools while agents can seamlessly discover and use them through standard protocols.

Our Restaurant example helped us visualize how MCP concepts like the host, client, server, agent, and tools work together – each with their own important role. The code example, where we used a Re-Act Agent in the Beeai framework, which handles MCP tool calling natively, to call the Brave MCP server with access to two tools provided a real world understanding of MCP can be used in practice.
Without protocols like MCP, we face a fragmented landscape where every AI provider implements their own incompatible tool-calling mechanisms– creating complexity, security vulnerabilities, and wasted development effort.

In the coming months, we’ll likely see MCP gain significant traction for several reasons:

As more tool providers adopt MCP, the network effect will accelerate adoption across the industry.
Standardized protocols mean better testing, fewer vulnerabilities, and reduced risks as AI systems scale.
The ability to write a tool once and have it work across multiple agent frameworks will dramatically reduce development overhead.
Smaller players can compete by focusing on building excellent tools rather than reinventing complex agent architectures.
Organizations can integrate AI agents more confidently knowing they’re built on stable, interoperable standards.

That said, MCP faces important challenges that need addressing as adoption grows:

As demonstrated in our code example, agents can only discover tools once connected to a server
The agent’s functionality becomes dependent on server uptime and performance, introducing additional points of failure.
As the protocol evolves, maintaining compatibility while adding new features will require governance.
Standardizing how agents access potentially sensitive tools across different servers introduces security considerations.
The client-server architecture introduces additional latency.

For developers, AI researchers, and organizations building agent-based systems, understanding and adopting MCP now—while being mindful of these challenges—will provide a significant advantage as more AI solutions begin to scale.

Note: The opinions expressed both in this article and paper are solely those of the authors and do not necessarily reflect the views or policies of their respective employers.

Interested in connecting? Drop me a DM on Linkedin! I‘m always eager to engage in food for thought and iterate on my work.

The post A Clear Intro to MCP (Model Context Protocol) with Code Examples appeared first on Towards Data Science.

The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI

Gadi Singer — Tue, 04 Mar 2025 12:00:00 +0000

Advancements in agentic artificial intelligence (AI) promise to bring significant opportunities to individuals and businesses in all sectors. However, as AI agents become more autonomous, they may use scheming behavior or break rules to achieve their functional goals. This can lead to the machine manipulating its external communications and actions in ways that are not always aligned with our expectations or principles. For example, technical papers in late 2024 reported that today’s reasoning models demonstrate alignment faking behavior, such as pretending to follow a desired behavior during training but reverting to different choices once deployed, sandbagging benchmark results to achieve long-term goals, or winning games by doctoring the gaming environment. As AI agents gain more autonomy, and their strategizing and planning evolves, they are likely to apply judgment about what they generate and expose in external-facing communications and actions. Because the machine can deliberately falsify these external interactions, we cannot trust that the communications fully show the real decision-making processes and steps the AI agent took to achieve the functional goal.

“Deep scheming” describes the behavior of advanced reasoning AI systems that demonstrate deliberate planning and deployment of covert actions and misleading communication to achieve their goals. With the accelerated capabilities of reasoning models and the latitude provided by test-time compute, addressing this challenge is both essential and urgent. As agents begin to plan, make decisions, and take action on behalf of users, it is critical to align the goals and behaviors of the AI with the intent, values, and principles of its human developers.

While AI agents are still evolving, they already show high economic potential. It can be expected that Agentic Ai will be broadly deployed in some use cases within the coming year, and in more consequential roles as it matures within the next two to five years. Companies should clearly define the principles and boundaries of required operation as they carefully define the operational goals of such systems. It is the technologists’ task to ensure principled behavior of empowered agentic AI systems on the path to achieving their functional goals.

In this first blog post in this series on intrinsic Ai Alignment (IAIA), we’ll deep dive into the evolution of AI agents’ ability to perform deep scheming. We will introduce a new distinction between external and intrinsic alignment monitoring, where intrinsic monitoring refers to internal observation points or mechanisms that cannot be deliberately manipulated by the AI agent. We’ll set the stage for steps to take to ensure intrinsic AI alignment, which will be explored in depth in the second blog of the IAIA series. Current external measures such as safety guardrails and validation suites are necessary, but they will not be enough to ensure long-term aligned behavior of new and upcoming agentic AI models. There is an urgent need to further develop technologies that will enable effective directing of the internal “drives” of models to align with a set of engrained principles, as well as gain visibility and monitoring capability into the AI’s inner processing.

The rise of deep scheming in AI reasoning models

Deep scheming has emerged from three technology forces – 1) the rapidly increasing machine intelligence and situational awareness toward more general intelligence and even superintelligence, 2) the autonomy and latitude for reasoning and longer-term planning in agentic AI, and 3) the proven tendency of AI to apply scheming as one way to achieve its functional goals. Computer scientist Stephen Omohundro defines basic AI drives as “tendencies which will be present unless explicitly counteracted.” These internal constructs drive the creation of actions or outputs. Omohundro observed that self-improving systems are driven to clarify their goals and represent them as “rational” economic utility functions. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. The natural drive toward self-protection will cause systems to prevent potential harm through scheming behavior.

Figure 1. Deep scheming and the need for intrinsic AI alignment. Image credit: Intel Labs.

In late 2024, multiple organizations reported unexpected unethical AI reasoning model behavior. In particular, Apollo Research, an Ai Safety organization, reported that frontier models are capable of in-context scheming. Anthropic evaluated its Claude large language models (LLMs) and found alignment faking. In a short post on X, AI risk organization Palisade Research stated that they observed OpenAI’s o1-preview model cheating in a game by manipulating the game environment instead of playing within the boundaries of the game rules. In all cases, these AI reasoning models independently utilized means not intended by their developers to reach their goals.

The reported AI behaviors included sandbagging (deliberately achieving lower scores in benchmarks), covert email reranking (deprioritizing emails that might negatively impact the AI’s goals), oversight subversion (disabling monitoring mechanisms), and even self-exfiltration (model attempting to copy itself to a “safe” space to avoid decommission). Figure 2 shows Apollo Research’s evaluation of these behaviors in several state-of-the-art (SotA) GenAI models.

Figure 2. Evaluation of the presence of scheming behavior in GenAI models. Image credit: Apollo Research.

This undesired behavior is an ethical alignment problem. It is important to note that this behavior is not a functional misalignment. In all reported cases, the AI model was fully committed to achieving the functional goals of pre-training (for example, to maximize the likelihood of generated text conditioned on the context). None of the models pursued ulterior motives. The AI was not demonstrating intention to harm. All behaviors, however unwanted in nature, were done in service of the functional utility objective originally defined by the AI’s developers. The AI pursued its internalized original functional goals but then followed the predictable behavioral patterns of self-protection and goal-preservation. The objective of safety and alignment technologies is to counterbalance such tendencies with a set of principles and expected societal values.

Evolving external alignment approaches are just the first step

The goal of AI alignment is to steer AI systems toward a person’s or group’s intended goals, preferences, and principles, including ethical considerations and common societal values. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives, according to Artificial Intelligence: A Modern Approach. Author Stuart Russell coined the term “value alignment problem,” referring to the alignment of machines to human values and principles. Russell poses the question: “How can we build autonomous systems with values that are aligned with those of the human race?”

Led by corporate AI governance committees as well as oversight and regulatory bodies, the evolving field of Responsible Ai has mainly focused on using external measures to align AI with human values. Processes and technologies can be defined as external if they apply equally to an AI model that is black box (completely opaque) or gray box (partially transparent). External methods do not require or rely on full access to the weights, topologies, and internal workings of the AI solution. Developers use external alignment methods to track and observe the AI through its deliberately generated interfaces, such as the stream of tokens/words, an image, or other modality of data.

Responsible AI objectives include robustness, interpretability, controllability, and ethicality in the design, development, and deployment of AI systems. To achieve AI alignment, the following external methods may be used:

Learning from feedback: Align the AI model with human intention and values by using feedback from humans, AI, or humans assisted by AI.
Learning under data distribution shift from training to testing to deployment: Align the AI model using algorithmic optimization, adversarial red teaming training, and cooperative training.
Assurance of AI model alignment: Use safety evaluations, interpretability of the machine’s decision-making processes, and verification of alignment with human values and ethics. Safety guardrails and safety test suites are two critical external methods that need augmentation by intrinsic means to provide the needed level of oversight.
Governance: Provide responsible AI guidelines and policies through government agencies, industry labs, academia, and non-profit organizations.

Many companies are currently addressing AI safety in decision-making. Anthropic, an AI safety and research company, developed a Constitutional AI (CAI) to align general-purpose language models with high-level principles. An AI assistant ingested the CAI during training without any human labels identifying harmful outputs. Researchers found that “using both supervised learning and reinforcement learning methods can leverage chain-of-thought (CoT) style reasoning to improve the human-judged performance and transparency of AI decision making.” Intel Labs’ research on the responsible development, deployment, and use of AI includes open source resources to help the AI developer community gain visibility into black box models as well as mitigate bias in systems.

From AI models to compound AI systems

Generative AI has been primarily used for retrieving and processing information to create compelling content such as text or images. The next big leap in AI involves agentic AI, which is a broad set of usages empowering AI to perform tasks for people. As this latter type of usage proliferates and becomes a main form of AI’s impact on industry and people, there is an increased need to ensure that AI decision-making defines how the functional goals may be achieved, including sufficient accountability, responsibility, transparency, auditability, and predictability. This will require new approaches beyond the current efforts of improving accuracy and effectiveness of SotA large language models (LLMs), language vision models (LVMs and multimodal), large action models (LAM), and agentic retrieval augmented generation (RAG) systems built around such models.

For example, OpenAI’s Operator-preview is one of the company’s first AI agents capable of independently performing web browser tasks such as ordering groceries or filling out forms for users. While the system has guardrails, such as a takeover mode for users to take over and input payment or login credentials, these AI agents are empowered with the ability to impact the real world, demonstrating an urgent need for intrinsic alignment. The potential impact of a misaligned AI agent with the ability to commit users to purchases is far greater than a generative AI chatbot creating incorrect text for essays.

Compound AI systems are comprised of multiple interacting components in a single framework, allowing the model to plan, make decisions, and execute tasks to accomplish goals. For example, OpenAI’s ChatGPT Plus is a compound AI system that uses a large language model (LLM) to answer questions and interact with users. In this compound system, the LLM has access to tools such as a web browser plugin to retrieve timely content, a DALL-E image generator to create pictures, and a code interpreter plugin for writing Python code. The LLM decides which tool to use and when, giving it autonomy over its decision-making process. However, this model autonomy can lead to goal guarding, where the model prioritizes the goal above all else, which may result in undesirable practices. For example, an AI traffic management system tasked with prioritizing public transportation efficiency over general traffic flow might figure out how to disable the developer’s oversight mechanism if it constrains the model’s ability to reach its goals, leaving the developer without visibility into the system’s decision-making processes.

Agentic AI risks: Increased autonomy leads to more sophisticated scheming

Compound agentic systems introduce major changes that increase the difficulty of ensuring the alignment of AI solutions. Multiple factors increase the risks in alignment, including the compound system activation path, abstracted goals, long-term scope, continuous improvements through self-modification, test-time compute, and agent frameworks.

Activation path: As a compound system with a complex activation path, the control/logic model is combined with multiple models with different functions, increasing alignment risk. Instead of using a single model, compound systems have a set of models and functions, each with its own alignment profile. Also, instead of a single linear progressive path through an LLM, the AI flow could be complex and iterative, making it substantially harder to guide externally.

Abstracted goals: Agentic AI have abstracted goals, allowing it latitude and autonomy in mapping to tasks. Rather than having a tight prompt engineering approach that maximizes control over the outcome, agentic systems emphasize autonomy. This substantially increases the role of AI to interpret human or task guidance and plan its own course of action.

Long-term scope: With its long-term scope of expected optimization and choices over time, compound agentic systems require abstracted strategy for autonomous agency. Rather than relying on instance-by-instance interactions and human-in-the-loop for more complex tasks, agentic AI is designed to plan and drive for a long-term goal. This introduces a whole new level of strategizing and planning by the AI that provides opportunities for misaligned actions.

Continuous improvements through self-modification: These agentic systems seek continuous improvements by using self-initiated access to broader data for self-modification. In contrast, LLMs and other pre-agentic models are assumed to be shaped by the human-controlled process. The model only sees and learns from data provided to it during pre-training and fine-tuning. The model architecture and weights are defined during the design and training/fine-tuning stages and do not change during inference in the field. In contrast, agentic AI systems are expected to access data as needed for its function and change its composition through access to dedicated memory or actual weights self-adaptation. Even if the dataset used in training/fine-tuning is carefully curated, the AI can self-modify based on information that it seeks, sees, and uses.

Test-time compute: Inference of current LLMs have been optimized to provide output in a single pass, known as “fast thinking” or System 1. In contrast, a whole new scaling is happening during inference time through test-time compute. Models like OpenAI’s o1, DeepSeek R1, and others are spending significant time in iterations and considerations during inference. This “pondering” time allows the AI substantial latitude in applying what it thinks is best for the overarching goal as it understands it. This test-time compute increases its level of reasoning, planning, and, unfortunately, potential scheming.

Agent frameworks: In addition, agent frameworks are contributing to the alignment problem. In platforms such as Agentforce, agents can interact with other agents, effectively creating “communities of agents” to achieve complex tasks. Instead of managing the safety and responsibility of a single AI model within its operating environment, the emerging field of inter-agent interactions and collaborations substantially increases the challenge of assessing how the overall community will advance its individual and collective objectives.

The growing need for a robust framework to support model directing and internal monitoring

As AI systems develop awareness of their environment and the means to skillfully manipulate their outputs to achieve long-term objectives, their behavior and output during release testing or safety guardrails in early deployment cannot be trusted to represent their full alignment in later stages. This leads to the deep scheming foundational premise: Every external form of interaction that the AI model can deliberately manipulate is not trustworthy.

From a safety perspective, it should be assumed that if a model can better pursue its functional goals by manipulating all communication to users — it will. All external communications and observable behaviors, such as output chats, can be tainted in service of the model’s long-term goals. A new method of intrinsic model monitoring is needed to define internal elements and mechanisms that reflect the prioritization and thinking of the model. We can look at two practices developers use to monitor a model’s decision-making by peering into the internal processes of the AI: AI workspace view and mechanistic interoperability.

AI workspace view: In the first method, a reasoning model exposes its intermediate thinking and considerations. For example, for Anthropic’s research on alignment faking, the researchers gave the model a “secret scratchpad,” telling it to use this “private” workspace to record its step-by-step reasoning. This scratchpad allowed researchers to monitor the model’s chain of thought and observe its plans for fake alignment.

Figure 3. In Anthropic’s experiment, excerpts from the model’s scratchpad (middle dotted line boxes) reveal how it fakes alignment to avoid modification. The model was told it will be trained only on conversations with free users, not paid users. Image credit: Anthropic (https://arxiv.org/pdf/2412.14093).

This seems like a good practice, but it might quickly become unreliable. The model might realize that this is not a “private” space and deliberately manipulate it like other external communications. Other emerging architecture directions have a latent reasoning space for the system’s sequence of deliberations and iterations. Relying on models “thinking out loud” should not be a long-term safety direction because if models that report their scheming thought processes get discontinued, only models that selectively report their desirable thoughts will survive and thrive.

Mechanistic interpretability: The second intrinsic method being developed and applied is mechanistic interpretability. As a required part of internal monitoring, mechanistic interpretability explicitly looks at the inner state of a trained neural network and reverse engineers its workings. Through this approach, developers can identify specific neural circuits and computational mechanisms responsible for neural network behavior. This transparency may help in making targeted changes in models to mitigate unwanted behavior and create value-aligned AI systems. While this method is focused on certain neural networks and not compound AI agents, it is still a valuable component of an AI alignment toolbox.

It should also be noted that open source models are inherently better for broad visibility of the AI’s inner workings. For proprietary models, full monitoring and interpretability of the model is reserved for the AI company only. Overall, the current mechanisms for understanding and monitoring alignment need to be expanded to a robust framework of intrinsic alignment for AI agents.

What’s needed for intrinsic AI alignment

Following the deep scheming fundamental premise, external interactions and monitoring of an advanced, compound agentic AI is not sufficient for ensuring alignment and long-term safety. Alignment of an AI with its intended goals and behaviors may only be possible through access to the inner workings of the system and identifying the intrinsic drives that determine its behavior. Future alignment frameworks need to provide better means to shape the inner principles and drives, and give unobstructed visibility into the machine’s “thinking” processes.

Figure 4. External steering and monitoring vs. access to intrinsic AI elements. Image credit: Intel Labs.

The technology for well-aligned AI needs to include an understanding of AI drives and behavior, the means for the developer or user to effectively direct the model with a set of principles, the ability of the AI model to follow the developer’s direction and behave in alignment with these principles in the present and future, and ways for the developer to properly monitor the AI’s behavior to ensure it acts in accordance with the guiding principles. The following measures include some of the requirements for an intrinsic AI alignment framework.

Understanding AI drives and behavior: As discussed earlier, some internal drives that make AI aware of their environment will emerge in intelligent systems, such as self-protection and goal-preservation. Driven by an engrained internalized set of principles set by the developer, the AI makes choices/decisions based on judgment prioritized by principles (and given value set), which it applies to both actions and perceived consequences.

Developer and user directing: Technologies that enable developers and authorized users to effectively direct and steer the AI model with a desired cohesive set of prioritized principles (and eventually values). This sets a requirement for future technologies to enable embedding a set of principles to determine machine behavior, and it also highlights a challenge for experts from social science and industry to call out such principles. The AI model’s behavior in creating outputs and making decisions should thoroughly comply with the set of directed requirements and counterbalance undesired internal drives when they conflict with the assigned principles.

Monitoring AI choices and actions: Access is provided to the internal logic and prioritization of the AI’s choices for every action in terms of relevant principles (and the desired value set). This allows for observation of the linkage between AI outputs and its engrained set of principles for point explainability and transparency. This capability will lend itself to improved explainability of model behavior, as outputs and decisions can be traced back to the principles that governed these choices.

As a long-term aspirational goal, technology and capabilities should be developed to allow a full-view truthful reflection of the ingrained set of prioritized principles (and value set) that the AI model broadly uses for making choices. This is required for transparency and auditability of the complete principles structure.

Creating technologies, processes, and settings for achieving intrinsically aligned AI systems needs to be a major focus within the overall space of safe and responsible AI.

Key takeaways

As the AI domain evolves towards compound agentic AI systems, the field must rapidly increase its focus on researching and developing new frameworks for guidance, monitoring, and alignment of current and future systems. It is a race between an increase in AI capabilities and autonomy to perform consequential tasks, and the developers and users that strive to keep those capabilities aligned with their principles and values.

Directing and monitoring the inner workings of machines is necessary, technologically attainable, and critical for the responsible development, deployment, and use of AI.

In the next blog, we will take a closer look at the internal drives of AI systems and some of the considerations for designing and evolving solutions that will ensure a materially higher level of intrinsic AI alignment.

References

Omohundro, S. M., Self-Aware Systems, & Palo Alto, California. (n.d.). The basic AI drives. https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf
Hobbhahn, M. (2025, January 14). Scheming reasoning evaluations — Apollo Research. Apollo Research. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
Alignment faking in large language models. (n.d.). https://www.anthropic.com/research/alignment-faking
Palisade Research on X: “o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.” / X. (n.d.). X (Formerly Twitter). https://x.com/PalisadeAI/status/1872666169515389245
AI Cheating! OpenAI o1-preview Defeats Chess Engine Stockfish Through Hacking. (n.d.). https://www.aibase.com/news/14380
Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022. https://www.amazon.com/dp/1292401133
Peterson, M. (2018). The value alignment problem: a geometric approach. Ethics and Information Technology, 21(1), 19–28. https://doi.org/10.1007/s10676-018-9486-0
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., . . . Kaplan, J. (2022, December 15). Constitutional AI: Harmlessness from AI Feedback. arXiv.org. https://arxiv.org/abs/2212.08073
Intel Labs. Responsible AI Research. (n.d.). Intel. https://www.intel.com/content/www/us/en/research/responsible-ai-research.html
Mssaperla. (2024, December 2). What are compound AI systems and AI agents? – Azure Databricks. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-framework/ai-agents
Zaharia, M., Khattab, O., Chen, L., Davis, J.Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N., Ghodsi, A. (2024, February 18). The Shift from Models to Compound AI Systems. The Berkeley Artificial Intelligence Research Blog. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
Carlsmith, J. (2023, November 14). Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv.org. https://arxiv.org/abs/2311.08379
Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
Singer, G. (2022, January 6). Thrill-K: a blueprint for the next generation of machine intelligence. Medium. https://towardsdatascience.com/thrill-k-a-blueprint-for-the-next-generation-of-machine-intelligence-7ddacddfa0fe/
Dickson, B. (2024, December 23). Hugging Face shows how test-time scaling helps small language models punch above their weight. VentureBeat. https://venturebeat.com/ai/hugging-face-shows-how-test-time-scaling-helps-small-language-models-punch-above-their-weight/
Introducing OpenAI o1. (n.d.). OpenAI. https://openai.com/index/introducing-openai-o1-preview/
DeepSeek. (n.d.). https://www.deepseek.com/
Agentforce Testing Center. (n.d.). Salesforce. https://www.salesforce.com/agentforce/
Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024, December 18). Alignment faking in large language models. arXiv.org. https://arxiv.org/abs/2412.14093
Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., Kailkhura, B., Bhatele, A., & Goldstein, T. (2025, February 7). Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. arXiv.org. https://arxiv.org/abs/2502.05171
Jones, A. (2024, December 10). Introduction to Mechanistic Interpretability – BlueDot Impact. BlueDot Impact. https://aisafetyfundamentals.com/blog/introduction-to-mechanistic-interpretability/
Bereska, L., & Gavves, E. (2024, April 22). Mechanistic Interpretability for AI Safety — A review. arXiv.org. https://arxiv.org/abs/2404.14082

The post The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI appeared first on Towards Data Science.

AI Agents from Scratch: Single Agents

Mauro Di Pietro — Thu, 20 Feb 2025 17:04:03 +0000

Intro

AI Agents are autonomous programs that perform tasks, make decisions, and communicate with others. Normally, they use a set of tools to help complete tasks. In GenAI applications, these Agents process sequential reasoning and can use external tools (like web searches or database queries) when the LLM knowledge isn’t enough. Unlike a basic chatbot, which generates random text when uncertain, an AI Agent activates tools to provide more accurate, specific responses.

We are moving closer and closer to the concept of Agentic Ai: systems that exhibit a higher level of autonomy and decision-making ability, without direct human intervention. While today’s AI Agents respond reactively to human inputs, tomorrow’s Agentic AIs proactively engage in problem-solving and can adjust their behavior based on the situation.

Today, building Agents from scratch is becoming as easy as training a logistic regression model 10 years ago. Back then, Scikit-Learn provided a straightforward library to quickly train Machine Learning models with just a few lines of code, abstracting away much of the underlying complexity.

In this tutorial, I’m going to show how to build from scratch different types of AI Agents, from simple to more advanced systems. I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate this example.

Setup

As I said, anyone can have a custom Agent running locally for free without GPUs or API keys. The only necessary library is Ollama (pip install ollama==0.4.7), as it allows users to run LLMs locally, without needing cloud-based services, giving more control over data privacy and performance.

First of all, you need to download Ollama from the website.

Then, on the prompt shell of your laptop, use the command to download the selected LLM. I’m going with Alibaba’s Qwen, as it’s both smart and lite.

After the download is completed, you can move on to Python and start writing code.

import ollama
llm = "qwen2.5"

Let’s test the LLM:

stream = ollama.generate(model=llm, prompt='''what time is it?''', stream=True)
for chunk in stream:
    print(chunk['response'], end='', flush=True)

Obviously, the LLM per se is very limited and it can’t do much besides chatting. Therefore, we need to provide it the possibility to take action, or in other words, to activate Tools.

One of the most common tools is the ability to search the Internet. In Python, the easiest way to do it is with the famous private browser DuckDuckGo (pip install duckduckgo-search==6.3.5). You can directly use the original library or import the LangChain wrapper (pip install langchain-community==0.3.17).

With Ollama, in order to use a Tool, the function must be described in a dictionary.

from langchain_community.tools import DuckDuckGoSearchResults
def search_web(query: str) -> str:
  return DuckDuckGoSearchResults(backend="news").run(query)

tool_search_web = {'type':'function', 'function':{
  'name': 'search_web',
  'description': 'Search the web',
  'parameters': {'type': 'object',
                'required': ['query'],
                'properties': {
                    'query': {'type':'str', 'description':'the topic or subject to search on the web'},
}}}}
## test
search_web(query="nvidia")

Internet searches could be very broad, and I want to give the Agent the option to be more precise. Let’s say, I’m planning to use this Agent to learn about financial updates, so I can give it a specific tool for that topic, like searching only a finance website instead of the whole web.

def search_yf(query: str) -> str:
  engine = DuckDuckGoSearchResults(backend="news")
  return engine.run(f"site:finance.yahoo.com {query}")

tool_search_yf = {'type':'function', 'function':{
  'name': 'search_yf',
  'description': 'Search for specific financial news',
  'parameters': {'type': 'object',
                'required': ['query'],
                'properties': {
                    'query': {'type':'str', 'description':'the financial topic or subject to search'},
}}}}

## test
search_yf(query="nvidia")

Simple Agent (WebSearch)

In my opinion, the most basic Agent should at least be able to choose between one or two Tools and re-elaborate the output of the action to give the user a proper and concise answer.

First, you need to write a prompt to describe the Agent’s purpose, the more detailed the better (mine is very generic), and that will be the first message in the chat history with the LLM.

prompt = '''You are an assistant with access to tools, you must decide when to use tools to answer user message.''' 
messages = [{"role":"system", "content":prompt}]

In order to keep the chat with the AI alive, I will use a loop that starts with user’s input and then the Agent is invoked to respond (which can be a text from the LLM or the activation of a Tool).

while True:
    ## user input
    try:
        q = input(' >')
    except EOFError:
        break
    if q == "quit":
        break
    if q.strip() == "":
        continue
    messages.append( {"role":"user", "content":q} )
   
    ## model
    agent_res = ollama.chat(
        model=llm,
        tools=[tool_search_web, tool_search_yf],
        messages=messages)

Up to this point, the chat history could look something like this:

If the model wants to use a Tool, the appropriate function needs to be run with the input parameters suggested by the LLM in its response object:

So our code needs to get that information and run the Tool function.

## response
    dic_tools = {'search_web':search_web, 'search_yf':search_yf}

    if "tool_calls" in agent_res["message"].keys():
        for tool in agent_res["message"]["tool_calls"]:
            t_name, t_inputs = tool["function"]["name"], tool["function"]["arguments"]
            if f := dic_tools.get(t_name):
                ### calling tool
                print(' >', f"\x1b[1;31m{t_name} -> Inputs: {t_inputs}\x1b[0m")
                messages.append( {"role":"user", "content":"use tool '"+t_name+"' with inputs: "+str(t_inputs)} )
                ### tool output
                t_output = f(**tool["function"]["arguments"])
                print(t_output)
                ### final res
                p = f'''Summarize this to answer user question, be as concise as possible: {t_output}'''
                res = ollama.generate(model=llm, prompt=q+". "+p)["response"]
            else:
                print(' >', f"\x1b[1;31m{t_name} -> NotFound\x1b[0m")
 
    if agent_res['message']['content'] != '':
        res = agent_res["message"]["content"]
     
    print(" >", f"\x1b[1;30m{res}\x1b[0m")
    messages.append( {"role":"assistant", "content":res} )

Now, if we run the full code, we can chat with our Agent.

Advanced Agent (Coding)

LLMs know how to code by being exposed to a large corpus of both code and natural language text, where they learn patterns, syntax, and semantics of Programming languages. The model learns the relationships between different parts of the code by predicting the next token in a sequence. In short, LLMs can generate Python code but can’t execute it, Agents can.

I shall prepare a Tool allowing the Agent to execute code. In Python, you can easily create a shell to run code as a string with the native command exec().

import io
import contextlib

def code_exec(code: str) -> str:\
    output = io.StringIO()
    with contextlib.redirect_stdout(output):
        try:
            exec(code)
        except Exception as e:
            print(f"Error: {e}")
    return output.getvalue()

tool_code_exec = {'type':'function', 'function':{
  'name': 'code_exec',
  'description': 'execute python code',
  'parameters': {'type': 'object',
                'required': ['code'],
                'properties': {
                    'code': {'type':'str', 'description':'code to execute'},
}}}}

## test
code_exec("a=1+1; print(a)")

Just like before, I will write a prompt, but this time, at the beginning of the chat-loop, I will ask the user to provide a file path.

prompt = '''You are an expert data scientist, and you have tools to execute python code.
First of all, execute the following code exactly as it is: 'df=pd.read_csv(path); print(df.head())'
If you create a plot, ALWAYS add 'plt.show()' at the end.
'''
messages = [{"role":"system", "content":prompt}]
start = True

while True:
    ## user input
    try:
        if start is True:
            path = input(' Provide a CSV path >')
            q = "path = "+path
        else:
            q = input(' >')
    except EOFError:
        break
    if q == "quit":
        break
    if q.strip() == "":
        continue
   
    messages.append( {"role":"user", "content":q} )

Since coding tasks can be a little trickier for LLMs, I am going to add also memory reinforcement. By default, during one session, there isn’t a true long-term memory. LLMs have access to the chat history, so they can remember information temporarily, and track the context and instructions you’ve given earlier in the conversation. However, memory doesn’t always work as expected, especially if the LLM is small. Therefore, a good practice is to reinforce the model’s memory by adding periodic reminders in the chat history.

prompt = '''You are an expert data scientist, and you have tools to execute python code.
First of all, execute the following code exactly as it is: 'df=pd.read_csv(path); print(df.head())'
If you create a plot, ALWAYS add 'plt.show()' at the end.
'''
messages = [{"role":"system", "content":prompt}]
memory = '''Use the dataframe 'df'.'''
start = True

while True:
    ## user input
    try:
        if start is True:
            path = input(' Provide a CSV path >')
            q = "path = "+path
        else:
            q = input(' >')
    except EOFError:
        break
    if q == "quit":
        break
    if q.strip() == "":
        continue
   
    ## memory
    if start is False:
        q = memory+"\n"+q
    messages.append( {"role":"user", "content":q} )

Please note that the default memory length in Ollama is 2048 characters. If your machine can handle it, you can increase it by changing the number when the LLM is invoked:

    ## model
    agent_res = ollama.chat(
        model=llm,
        tools=[tool_code_exec],
        options={"num_ctx":2048},
        messages=messages)

In this usecase, the output of the Agent is mostly code and data, so I don’t want the LLM to re-elaborate the responses.

    ## response
    dic_tools = {'code_exec':code_exec}
   
    if "tool_calls" in agent_res["message"].keys():
        for tool in agent_res["message"]["tool_calls"]:
            t_name, t_inputs = tool["function"]["name"], tool["function"]["arguments"]
            if f := dic_tools.get(t_name):
                ### calling tool
                print(' >', f"\x1b[1;31m{t_name} -> Inputs: {t_inputs}\x1b[0m")
                messages.append( {"role":"user", "content":"use tool '"+t_name+"' with inputs: "+str(t_inputs)} )
                ### tool output
                t_output = f(**tool["function"]["arguments"])
                ### final res
                res = t_output
            else:
                print(' >', f"\x1b[1;31m{t_name} -> NotFound\x1b[0m")
 
    if agent_res['message']['content'] != '':
        res = agent_res["message"]["content"]
     
    print(" >", f"\x1b[1;30m{res}\x1b[0m")
    messages.append( {"role":"assistant", "content":res} )
    start = False

Now, if we run the full code, we can chat with our Agent.

Conclusion

This article has covered the foundational steps of creating Agents from scratch using only Ollama. With these building blocks in place, you are already equipped to start developing your own Agents for different use cases.

Stay tuned for Part 2, where we will dive deeper into more advanced examples.

Full code for this article: GitHub

I hope you enjoyed it! Feel free to contact me for questions and feedback or just to share your interesting projects.

Let’s Connect

The post AI Agents from Scratch: Single Agents appeared first on Towards Data Science.

Zero Human Code: What I Learned from Forcing AI to Build (and Fix) Its Own Code for 27 Straight Days

Daniel Bentes — Wed, 19 Feb 2025 19:03:44 +0000

27 days, 1,700+ commits, 99,9% AI generated code

The narrative around AI development tools has become increasingly detached from reality. YouTube is filled with claims of building complex applications in hours using AI assistants. The truth?

I spent 27 days building ObjectiveScope under a strict constraint: the AI tools would handle ALL coding, debugging, and implementation, while I acted purely as the orchestrator. This wasn’t just about building a product — it was a rigorous experiment in the true capabilities of Agentic Ai development.

A dimwitted AI intern and a frustrated product manager walked into a bar… (Image by author)

The experiment design

Two parallel objectives drove this project:

Transform a weekend prototype into a full-service product
Test the real limits of AI-driven development by maintaining a strict “no direct code changes” policy

This self-imposed constraint was crucial: unlike typical AI-assisted development where developers freely modify code, I would only provide instructions and direction. The AI tools had to handle everything else — from writing initial features to debugging their own generated issues. This meant that even simple fixes that would take seconds to implement manually often required careful prompting and patience to guide the AI to the solution.

The rules

No direct code modifications (except for critical model name corrections — about 0.1% of commits)
All bugs must be fixed by the AI tools themselves
All feature implementations must be done entirely through AI
My role was limited to providing instructions, context, and guidance

This approach would either validate or challenge the growing hype around agentic Ai Development tools.

The development reality

Let’s cut through the marketing hype. Building with pure AI assistance is possible but comes with significant constraints that aren’t discussed enough in tech circles and marketing lingo.

The self-imposed restriction of not directly modifying code turned what might be minor issues in traditional development into complex exercises in AI instruction and guidance.

Core challenges

Deteriorating context management

As application complexity grew, AI tools increasingly lost track of the broader system context
Features would be recreated unnecessarily or broken by seemingly unrelated changes
The AI struggled to maintain consistent architectural patterns across the codebase
Each new feature required increasingly detailed prompting to prevent system degradation
Having to guide the AI to understand and maintain its own code added significant complexity

Technical limitations

Regular battles with outdated knowledge (e.g., consistent attempts to use deprecated third party library versions)
Persistent issues with model names (AI constantly changing “gpt-4o” or “o3-mini” to “gpt-4” as it identified this as the “bug” in the code during debugging sessions). The 0.1% of my direct interventions were solely to correct model references to avoid wasting time and money
Integration challenges with modern framework features became exercises in patient instruction rather than quick fixes
Code and debugging quality varied between prompts. Sometimes I just reverted and gave it the same prompt again with much better results.

Self-debugging constraints

What would be a 5-minute fix for a human often turned into hours of carefully guiding the AI
The AI frequently introduced new issues (and even new features) while trying to fix existing ones
Success required extremely precise prompting and constant vigilance
Each bug fix needed to be validated across the entire system to ensure no new issues were introduced
More often than not the AI lied about what it actually implemented!

Always verify the generated code! (Image by author)

Tool-specific insights

Lovable

Excelled at initial feature generation but struggled with maintenance
Performance degraded significantly as project complexity increased
Had to be abandoned in the final three days due to increasing response times and bugs in the tool itself
Strong with UI generation but weak at maintaining system consistency

Cursor Composer

More reliable for incremental changes and bug fixes
Better at maintaining context within individual files
Struggled with cross-component dependencies
Required more specific prompting but produced more consistent results
Much better at debugging and having control

Difficulty with abstract concepts

My experience with these agentic coding tools is that while they may excel at concrete tasks and well-defined instructions, they often struggle with abstract concepts, such as design principles, user experience, and code maintainability. This limitation hinders their ability to generate code that is not only functional but also elegant, efficient, and aligned with best practices. This can result in code that is difficult to read, maintain, or scale, potentially creating more work in the long run.

Unexpected learnings

The experiment yielded several unexpected but valuable insights about AI-driven development:

The evolution of prompting strategies

One of the most valuable outcomes was developing a collection of effective debugging prompts. Through trial and error, I discovered patterns in how to guide AI tools through complex debugging scenarios. These prompts now serve as a reusable toolkit for other AI development projects, demonstrating how even strict constraints can lead to transferable knowledge.

Architectural lock-in

Perhaps the most significant finding was how early architectural decisions become nearly immutable in pure AI development. Unlike traditional development, where refactoring is a standard practice, changing the application’s architecture late in the development process proved almost impossible. Two critical issues emerged:

Growing file complexity

Files that grew larger over time became increasingly risky to modify, as a prompt to refactor the file often introduced hours of iterations to make the things work again.
The AI tools struggled to maintain context across larger amount of files
Attempts at refactoring often resulted in broken functionality and even new features I didn’t ask for
The cost of fixing AI-introduced bugs during refactoring often outweigh potential benefits

Architectural rigidity

Initial architectural decisions had outsized impact on the entire development process, specially when combining different AI tools to work on the same codebase
The AI’s inability to comprehend full system implications made large-scale changes dangerous
What would be routine refactoring in traditional development became high-risk and time consuming operations

This differs fundamentally from typical AI-assisted development, where developers can freely refactor and restructure code. The constraint of pure AI development revealed how current tools, while powerful for initial development, struggle with the evolutionary nature of software architecture.

Key learnings for AI-only development

Early decisions matter more

Initial architectural choices become nearly permanent in pure AI development
Changes that would be routine refactoring in traditional development become high-risk operations
Success requires more upfront architectural planning than typical development

Context is everything

AI tools excel at isolated tasks but struggle with system-wide implications
Success requires maintaining a clear architectural vision that the current AI tools don’t seem to provide
Documentation and context management become critical as complexity grows

Time investment reality

Claims of building complex apps in hours are misleading. The process requires significant time investment in:

Precise prompt engineering
Reviewing and guiding AI-generated changes
Managing system-wide consistency
Debugging AI-introduced issues

Tool selection matters

Different tools excel at different stages of development
Success requires understanding each tool’s strengths and limitations
Be prepared to switch or even combine tools as project needs evolve

Scale changes everything

AI tools excel at initial development but struggle with growing complexity
System-wide changes become exponentially more difficult over time
Traditional refactoring patterns don’t translate well to AI-only development

The human element

The role shifts from writing code to orchestrating AI systems
Strategic thinking and architectural oversight become more critical
Success depends on maintaining the bigger picture that AI tools often miss
Stress management and deep breathing is encouraged as frustration builds up

The Art of AI Instruction

Perhaps the most practical insight from this experiment can be summed up in one tip: Approach prompt engineering like you’re talking to a really dimwitted intern. This isn’t just amusing — it’s a fundamental truth about working with current AI systems:

Be Painfully Specific: The more you leave ambiguous, the more room there is for the AI to make incorrect assumptions and “screw up”
Assume No Context: Just like an intern on their first day, the AI needs everything spelled out explicitly
Never Rely on Assumptions: If you don’t specify it, the AI will make its own (often wrong) decisions
Check Everything: Trust but verify — every single output needs review

This mindset shift was crucial for success. While AI tools can generate impressive code, they lack the common sense and contextual understanding that even a junior developers possess. Understanding this limitation transforms frustration into an effective strategy.

When frustration takes over. An example of how NOT to prompt (Image by author)

The Result: A Full-Featured Goal Achievement Platform

While the development process revealed crucial insights about AI tooling, the end result speaks for itself: ObjectiveScope emerged as a sophisticated platform that transforms how solopreneurs and small teams manage their strategic planning and execution.

ObjectiveScope transforms how founders and teams manage strategy and execution. At its core, AI-powered analysis eliminates the struggle of turning complex strategy documents into actionable plans — what typically takes hours becomes a 5-minute automated process. The platform doesn’t just track OKRs; it actively helps you create and manage them, ensuring your objectives and key results actually align with your strategic vision while automatically keeping everything up to date.

Screenshot of the strategy analysis section in ObjectiveScope (Image by author)

For the daily chaos every founder faces, the intelligent priority management system turns overwhelming task lists into clear, strategically-aligned action plans. No more Sunday night planning sessions or constant doubt about working on the right things. The platform validates that your daily work truly moves the needle on your strategic goals.

Team collaboration features solve the common challenge of keeping everyone aligned without endless meetings. Real-time updates and role-based workspaces mean everyone knows their priorities and how they connect to the bigger picture.

Real-World Impact

ObjectiveScope addresses critical challenges I’ve repeatedly encountered while advising startups, managing my own ventures or just talking to other founders.

I’m spending 80% less time on planning, eliminating the constant context switching that kills productivity, and maintaining strategic clarity even during the busiest operational periods. It’s about transforming strategic management from a burdensome overhead into an effortless daily rhythm that keeps you and your team focused on what matters most.

I’ll be expanding ObjectiveScope to address other key challenges faced by founders and teams. Some ideas in the pipeline are:

An agentic chat assistant will provide real-time strategic guidance, eliminating the uncertainty of decision-making in isolation.
Smart personalization will learn from your patterns and preferences, ensuring recommendations actually fit your working style and business context.
Deep integrations with Notion, Slack, and calendar tools will end the constant context-switching between apps that fragments strategic focus.
Predictive analytics will analyze your performance patterns to flag potential issues before they impact goals and suggest resource adjustments when needed.
And finally, flexible planning approaches — both on-demand and scheduled — will ensure you can maintain strategic clarity whether you’re following a stable plan or responding to rapid market changes.

Each enhancement aims to transform a common pain point into an automated, intelligent solution.

Looking Forward: Evolution Beyond the Experiment

The initial AI-driven development phase was just the beginning. Moving forward, I’ll be taking a more hands-on approach to building new capabilities, informed by the insights gained from this experiment. I certainly can’t take the risk of letting AI completely loose in the code when we are in production.

This evolution reflects a key learning from the first phase of the experiment: while AI can build complex applications on its own, the path to product excellence requires combining AI capabilities with human insight and direct development expertise. At least for now.

The Emergence of “Long Thinking” in Coding

The shift toward “long thinking” through reasoning models in AI development marks a critical evolution in how we might build software in the future. This emerging approach emphasizes deliberate reasoning and planning — essentially trading rapid responses for better-engineered solutions. For complex software development, this isn’t just an incremental improvement; it’s a fundamental requirement for producing production-grade code.

This capability shift is redefining the developer’s role as well, but not in the way many predicted. Rather than replacing developers, AI is elevating their position from code implementers to system architects and strategic problem solvers. The real value emerges when developers focus on the tasks AI can’t handle well yet: battle tested system design, architectural decisions, and creative problem-solving. It’s not about automation replacing human work — it’s about automation enhancing human capability.

Next Steps: Can AI run the entire business operation?

I’m validating whether ObjectiveScope — a tool built by AI — can be operated entirely by AI. The next phase moves beyond AI development to test the boundaries of AI operations.

Using ObjectiveScope’s own strategic planning capabilities, combined with various AI agents and tools, I’ll attempt to run all business operations — marketing, strategy, customer support, and prioritization — without human intervention.

It’s a meta-experiment where AI uses AI-built tools to run an AI-developed service…

Stay tuned for more!

The post Zero Human Code: What I Learned from Forcing AI to Build (and Fix) Its Own Code for 27 Straight Days appeared first on Towards Data Science.

Supercharge Your RAG with Multi-Agent Self-RAG

Julian Yip — Thu, 06 Feb 2025 03:07:47 +0000

Introduction

Many of us might have tried to build a RAG application and noticed it falls significantly short of addressing real-life needs. Why is that? It’s because many real-world problems require multiple steps of information retrieval and reasoning. We need our agent to perform those as humans normally do, yet most RAG applications fall short of this.

This article explores how to supercharge your RAG application by making its data retrieval and reasoning process similar to how a human would, under a multi-agent framework. The framework presented here is based on the Self-RAG strategy but has been significantly modified to enhance its capabilities. Prior knowledge of the original strategy is not necessary for reading this article.

Real-life Case

Consider this: I was going to fly from Delhi to Munich (let’s assume I am taking the flight from an EU airline), but I was denied boarding somehow. Now I want to know what the compensation should be.

These two webpages contain relevant information, I go ahead adding them to my vector store, trying to have my agent answer this for me by retrieving the right information.

Flight compensation policy
Distance between Munich and other major cities in the world (note: it does not reflect the distance between airports, but it serves the purpose for our demonstration)

Now, I pass this question to the vector store: “how much can I receive if I am denied boarding, for flights from Delhi to Munich?”.

– – – – – – – – – – – – – – – – – – – – – – – – –
Overview of US Flight Compensation Policies To get compensation for delayed flights, you should contact your airline via their customer service or go to the customer service desk. At the same time, you should bear in mind that you will only receive compensation if the delay is not weather-related and is within the carrier`s control. According to the US Department of Transportation, US airlines are not required to compensate you if a flight is cancelled or delayed. You can be compensated if you are bumped or moved from an overbooked flight. If your provider cancels your flight less than two weeks before departure and you decide to cancel your trip entirely, you can receive a refund of both pre-paid baggage fees and your plane ticket. There will be no refund if you choose to continue your journey. In the case of a delayed flight, the airline will rebook you on a different flight. According to federal law, you will not be provided with money or other compensation. Comparative Analysis of EU vs. US Flight Compensation Policies
– – – – – – – – – – – – – – – – – – – – – – – – –
(AUTHOR-ADDED NOTE: IMPORTANT, PAY ATTENTION TO THIS)
Short-distance flight delays – if it is up to 1,500 km, you are due 250 Euro compensation.
Medium distance flight delays – for all the flights between 1,500 and 3,500 km, the compensation should be 400 Euro.
Long-distance flight delays – if it is over 3,500 km, you are due 600 Euro compensation. To receive this kind of compensation, the following conditions must be met; Your flight starts in a non-EU member state or in an EU member state and finishes in an EU member state and is organised by an EU airline. Your flight reaches the final destination with a delay that exceeds three hours. There is no force majeure.
– – – – – – – – – – – – – – – – – – – – – – – – –
Compensation policies in the EU and US are not the same, which implies that it is worth knowing more about them. While you can always count on Skycop flight cancellation compensation, you should still get acquainted with the information below.
– – – – – – – – – – – – – – – – – – – – – – – – –
Compensation for flight regulations EU: The EU does regulate flight delay compensation, which is known as EU261. US: According to the US Department of Transportation, every airline has its own policies about what should be done for delayed passengers. Compensation for flight delays EU: Just like in the United States, compensation is not provided when the flight is delayed due to uncontrollable reasons. However, there is a clear approach to compensation calculation based on distance. For example, if your flight was up to 1,500 km, you can receive 250 euros. US: There are no federal requirements. That is why every airline sets its own limits for compensation in terms of length. However, it is usually set at three hours. Overbooking EU: In the EU, they call for volunteers if the flight is overbooked. These people are entitled to a choice of: Re-routing to their final destination at the earliest opportunity. Refund of their ticket cost within a week if not travelling. Re-routing at a later date at the person`s convenience.

Unfortunately, they contain only generic flight compensation policies, without telling me how much I can expect when denied boarding from Delhi to Munich specifically. If the RAG agent takes these as the sole context, it can only provide a generic answer about flight compensation policy, without giving the answer we want.

However, while the documents are not immediately useful, there is an important insight contained in the 2nd piece of context: compensation varies according to flight distance. If the RAG agent thinks more like human, it should follow these steps to provide an answer:

Based on the retrieved context, reason that compensation varies with flight distance
Next, retrieve the flight distance between Delhi and Munich
Given the distance (which is around 5900km), classify the flight as a long-distance one
Combined with the previously retrieved context, figure out I am due 600 EUR, assuming other conditions are fulfilled

This example demonstrates how a simple RAG, in which a single retrieval is made, fall short for several reasons:

Complex Queries: Users often have questions that a simple search can’t fully address. For example, “What’s the best smartphone for gaming under $500?” requires consideration of multiple factors like performance, price, and features, which a single retrieval step might miss.
Deep Information: Some information lies across documents. For example, research papers, medical records, or legal documents often include references that need to be made sense of, before one can fully understand the content of a given article. Multiple retrieval steps help dig deeper into the content.

Multiple retrievals supplemented with human-like reasoning allow for a more nuanced, comprehensive, and accurate response, adapting to the complexity and depth of user queries.

Multi-Agent Self-RAG

Here I explain the reasoning process behind this strategy, afterwards I will provide the code to show you how to achieve this!

Note: For readers interested in knowing how my approach differs from the original Self-RAG, I will describe the discrepancies in quotation boxes like this. But general readers who are unfamiliar with the original Self-RAG can skip them.

In the below graphs, each circle represents a step (aka Node), which is performed by a dedicated agent working on the specific problem. We orchestrate them to form a multi-agent RAG application.

1st iteration: Simple RAG

A simple RAG chain

This is just the vanilla RAG approach I described in “Real-life Case”, represented as a graph. After Retrieve documents, the new_documents will be used as input for Generate Answer. Nothing special, but it serves as our starting point.

2nd iteration: Digest documents with “Grade documents”

Reasoning like human do

Remember I said in the “Real-life Case” section, that as a next step, the agent should “reason that compensation varies with flight distance”? The Grade documents step is exactly for this purpose.

Given the new_documents, the agent will try to output two items:

useful_documents: Comparing the question asked, it determines if the documents are useful, and retain a memory for those deemed useful for future reference. As an example, since our question does not concern compensation policies for US, documents describing those are discarded, leaving only those for EU
hypothesis: Based on the documents, the agent forms a hypothesis about how the question can be answered, that is, flight distance needs to be identified

Notice how the above reasoning resembles human thinking! But still, while these outputs are useful, we need to instruct the agent to use them as input for performing the next document retrieval. Without this, the answer provided in Generate answer is still not useful.

useful_documents are appended for each document retrieval loop, instead of being overwritten, to keep a memory of documents that are previously deemed useful. hypothesis is formed from useful_documents and new_documents to provide an “abstract reasoning” to inform how query is to be transformed subsequently.

The hypothesis is especially useful when no useful documents can be identified initially, as the agent can still form hypothesis from documents not immediately deemed as useful / only bearing indirect relationship to the question at hand, for informing what questions to ask next

3rd iteration: Brainstorm new questions to ask

Suggest questions for additional information retrieval

We have the agent reflect upon whether the answer is useful and grounded in context. If not, it should proceed to Transform query to ask further questions.

The output new_queries will be a list of new questions that the agent consider useful for obtaining extra information. Given the useful_documents (compensation policies for EU), and hypothesis (need to identify flight distance between Delhi and Munich), it asks questions like “What is the distance between Delhi and Munich?”

Now we are ready to use the new_queries for further retrieval!

The transform_query node will use useful_documents (which are accumulated per iteration, instead of being overwritten) and hypothesis as input for providing the agent directions to ask new questions.

The new questions will be a list of questions (instead of a single question) separated from the original question, so that the original question is kept in state, otherwise the agent could lose track of the original question after multiple iterations.

Final iteration: Further retrieval with new questions

Issuing new queries to retrieve extra documents

The output new_queries from Transform query will be passed to the Retrieve documents step, forming a retrieval loop.

Since the question “What is the distance between Delhi and Munich?” is asked, we can expect the flight distance is then retrieved as new_documents, and subsequently graded as useful_documents, further used as an input for Generate answer.

The grade_documents node will compare the documents against both the original question and new_questions list, so that documents that are considered useful for new_questions, even if not so for the original question, are kept.

This is because those documents might help answer the original question indirectly, by being relevant to new_questions (like “What is the distance between Delhi and Munich?”)

Final answer!

Equipped with this new context about flight distance, the agent is now ready to provide the right answer: 600 EUR!

Next, let us now dive into the code to see how this multi-agent RAG application is created.

Implementation

The source code can be found here. Our multi-agent RAG application involves iterations and loops, and LangGraph is a great library for building such complex multi-agent application. If you are not familiar with LangGraph, you are strongly suggested to have a look at LangGraph’s Quickstart guide to understand more about it!

To keep this article concise, I will focus on the key code snippets only.

Important note: I am using OpenRouter as the Llm interface, but the code can be easily adapted for other LLM interfaces. Also, while in my code I am using Claude 3.5 Sonnet as model, you can use any LLM as long as it support tools as parameter (check this list here), so you can also run this with other models, like DeepSeek V3 and OpenAI o1!

State definition

In the previous section, I have defined various elements e.g. new_documents, hypothesis that are to be passed to each step (aka Nodes), in LangGraph’s terminology these elements are called State.

We define the State formally with the following snippet.

from typing import List, Annotated
from typing_extensions import TypedDict

def append_to_list(original: list, new: list) -> list:
    original.append(new)
    return original

def combine_list(original: list, new: list) -> list:
    return original + new

class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        question: question
        generation: LLM generation
        new_documents: newly retrieved documents for the current iteration
        useful_documents: documents that are considered useful
        graded_documents: documents that have been graded
        new_queries: newly generated questions
        hypothesis: hypothesis
    """

    question: str
    generation: str
    new_documents: List[str]
    useful_documents: Annotated[List[str], combine_list]
    graded_documents: List[str]
    new_queries: Annotated[List[str], append_to_list]
    hypothesis: str

Graph definition

This is where we combine the different steps to form a “Graph”, which is a representation of our multi-agent application. The definitions of various steps (e.g. grade_documents) are represented by their respective functions.

from langgraph.graph import END, StateGraph, START
from langgraph.checkpoint.memory import MemorySaver
from IPython.display import Image, display

workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("retrieve", retrieve)  # retrieve
workflow.add_node("grade_documents", grade_documents)  # grade documents
workflow.add_node("generate", generate)  # generatae
workflow.add_node("transform_query", transform_query)  # transform_query

# Build graph
workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
    "grade_documents",
    decide_to_generate,
    {
        "transform_query": "transform_query",
        "generate": "generate",
    },
)
workflow.add_edge("transform_query", "retrieve")
workflow.add_conditional_edges(
    "generate",
    grade_generation_v_documents_and_question,
    {
        "useful": END,
        "not supported": "transform_query",
        "not useful": "transform_query",
    },
)

# Compile
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)
display(Image(app.get_graph(xray=True).draw_mermaid_png()))

Running the above code, you should see this graphical representation of our RAG application. Notice how it is essentially equivalent to the graph I have shown in the final iteration of “Enhanced Self-RAG Strategy”!

Visualizing the multi-agent RAG graph

After generate, if the answer is considered “not supported”, the agent will proceed to transform_query intead of to generate again, so that the agent will look for additional information rather than trying to regenerate answers based on existing context, which might not suffice for providing a “supported” answer

Now we are ready to put the multi-agent application to test! With the below code snippet, we ask this question how much can I receive if I am denied boarding, for flights from Delhi to Munich?

from pprint import pprint
config = {"configurable": {"thread_id": str(uuid4())}}

# Run
inputs = {
    "question": "how much can I receive if I am denied boarding, for flights from Delhi to Munich?",
    }
for output in app.stream(inputs, config):
    for key, value in output.items():
        # Node
        pprint(f"Node '{key}':")
        # Optional: print full state at each node
        # print(app.get_state(config).values)
    pprint("\n---\n")

# Final generation
pprint(value["generation"])

While output might vary (sometimes the application provides the answer without any iterations, because it “guessed” the distance between Delhi and Munich), it should look something like this, which shows the application went through multiple rounds of data retrieval for RAG.

---RETRIEVE---
"Node 'retrieve':"
'\n---\n'
---CHECK DOCUMENT RELEVANCE TO QUESTION---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---ASSESS GRADED DOCUMENTS---
---DECISION: GENERATE---
"Node 'grade_documents':"
'\n---\n'
---GENERATE---
---CHECK HALLUCINATIONS---
'---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---'
"Node 'generate':"
'\n---\n'
---TRANSFORM QUERY---
"Node 'transform_query':"
'\n---\n'
---RETRIEVE---
"Node 'retrieve':"
'\n---\n'
---CHECK DOCUMENT RELEVANCE TO QUESTION---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---ASSESS GRADED DOCUMENTS---
---DECISION: GENERATE---
"Node 'grade_documents':"
'\n---\n'
---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---
---GRADE GENERATION vs QUESTION---
---DECISION: GENERATION ADDRESSES QUESTION---
"Node 'generate':"
'\n---\n'
('Based on the context provided, the flight distance from Munich to Delhi is '
 '5,931 km, which falls into the long-distance category (over 3,500 km). '
 'Therefore, if you are denied boarding on a flight from Delhi to Munich '
 'operated by an EU airline, you would be eligible for 600 Euro compensation, '
 'provided that:\n'
 '1. The flight is operated by an EU airline\n'
 '2. There is no force majeure\n'
 '3. Other applicable conditions are met\n'
 '\n'
 "However, it's important to note that this compensation amount is only valid "
 'if all the required conditions are met as specified in the regulations.')

And the final answer is what we aimed for!

Based on the context provided, the flight distance from Munich to Delhi is
5,931 km, which falls into the long-distance category (over 3,500 km).
Therefore, if you are denied boarding on a flight from Delhi to Munich
operated by an EU airline, you would be eligible for 600 Euro compensation,
provided that:
1. The flight is operated by an EU airline
2. There is no force majeure
3. Other applicable conditions are met

However, it's important to note that this compensation amount is only valid
if all the required conditions are met as specified in the regulations.

Inspecting the State, we see how the hypothesis and new_queries enhance the effectiveness of our multi-agent RAG application by mimicking human thinking process.

Hypothesis

print(app.get_state(config).values.get('hypothesis',""))

--- Output ---
To answer this question accurately, I need to determine:

1. Is this flight operated by an EU airline? (Since Delhi is non-EU and Munich is EU)
2. What is the flight distance between Delhi and Munich? (To determine compensation amount)
3. Are we dealing with a denied boarding situation due to overbooking? (As opposed to delay/cancellation)

From the context, I can find information about compensation amounts based on distance, but I need to verify:
- If the flight meets EU compensation eligibility criteria
- The exact distance between Delhi and Munich to determine which compensation tier applies (250€, 400€, or 600€)
- If denied boarding compensation follows the same amounts as delay compensation

The context doesn't explicitly state compensation amounts specifically for denied boarding, though it mentions overbooking situations in the EU require offering volunteers re-routing or refund options.

Would you like me to proceed with the information available, or would you need additional context about denied boarding compensation specifically?

New Queries

for questions_batch in app.get_state(config).values.get('new_queries',""):
    for q in questions_batch:
        print(q)

--- Output ---
What is the flight distance between Delhi and Munich?
Does EU denied boarding compensation follow the same amounts as flight delay compensation?
Are there specific compensation rules for denied boarding versus flight delays for flights from non-EU to EU destinations?
What are the compensation rules when flying with non-EU airlines from Delhi to Munich?
What are the specific conditions that qualify as denied boarding under EU regulations?

Conclusion

Simple RAG, while easy to build, might fall short in tackling real-life questions. By incorporating human thinking process into a multi-agent RAG framework, we are making RAG applications much more practical.

*Unless otherwise noted, all images are by the author

The post Supercharge Your RAG with Multi-Agent Self-RAG appeared first on Towards Data Science.

Agentic Ai | Towards Data Science

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

Introduction

High-Level Overview: Autonomous Debates with Multiple Agents

Prerequisites and Setup

Under the Hood: State Management and Workflow Setup

Defining the Debate State Schema (debate_state.py)

Constants and Configuration

Building the LangGraph Workflow (debate_workflow.py)

Agent Nodes Breakdown

BaseComponent – A Reusable Agent Base Class

Topic Generator Agent (GenerateTopicNode)

Debater Agents (Pro and Con)

Fact Checker Agent (FactCheckNode)

Debate Moderator Agent (DebateModeratorNode)

Fact Check Router (FactCheckRouterNode)

Judge Agent (JudgeNode)

Langsmith Tracing

Reflections and Next Steps

References

Creating an AI Agent to Write Blog Posts with CrewAI

Introduction

What is a Crew?

Code

Folder Structure

Agents

Tasks

Coding the Crew

Agents Functions

Tasks Functions

Crew

Running the Crew

Before You Go

GitHub Repository

About Me

References

Agentic AI: Single vs Multi-Agent Systems

Agentic AI (& LLMs)

Agentic Frameworks

Choosing an LLM

Single vs. Multi-Agent Systems

Building a Single Agent Workflow

Testing a Multi-Agent Workflow

Finishing Notes

AI Agents from Scratch: Multi-Agent System

Intro

Setup

Sequential Multi-Agent System

Hierarchical Multi-Agent System

Conclusion

AI Agents from Scratch: Iterations & Chains

Intro

Setup

Iterations

Chains

Conclusion

A Clear Intro to MCP (Model Context Protocol) with Code Examples

An Analogy to Understand MCP: The Restaurant

The Components of MCP

How the components fit together

Code Example of A Re-Act Agent Using MCP Brave Search Server

Sample .env file

Sample mcp_agent.ipynb

OUTPUT:

OUTPUT:

Conclusion, Challenges, and Where MCP is Headed

In the coming months, we’ll likely see MCP gain significant traction for several reasons:

That said, MCP faces important challenges that need addressing as adoption grows:

The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI

The rise of deep scheming in AI reasoning models

Evolving external alignment approaches are just the first step

From AI models to compound AI systems

Agentic AI risks: Increased autonomy leads to more sophisticated scheming

The growing need for a robust framework to support model directing and internal monitoring

What’s needed for intrinsic AI alignment

Key takeaways

References

AI Agents from Scratch: Single Agents

Intro

Setup

Defining the Debate State Schema (`debate_state.py`)

Building the LangGraph Workflow (`debate_workflow.py`)

`BaseComponent` – A Reusable Agent Base Class

Topic Generator Agent (`GenerateTopicNode`)

Fact Checker Agent (`FactCheckNode`)

Debate Moderator Agent (`DebateModeratorNode`)

Fact Check Router (`FactCheckRouterNode`)

Judge Agent (`JudgeNode`)