Autonomous Agent | Towards Data Science

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

Iason Solomos — Thu, 10 Apr 2025 05:14:56 +0000

Introduction

I’ve always been fascinated by debates—the strategic framing, the sharp retorts, and the carefully timed comebacks. Debates aren’t just entertaining; they’re structured battles of ideas, driven by logic and evidence. Recently, I started wondering: could we replicate that dynamic using AI agents—having them debate each other autonomously, complete with real-time fact-checking and moderation? The result was Deb8flow, an autonomous AI debating environment powered by LangGraph, OpenAI’s GPT-4o model, and the new integrated Web Search feature.

In Deb8flow, two agents—Pro and Con—square off on a given topic while a Moderator manages turn-taking. A dedicated Fact Checker reviews every claim in real time using GPT-4o’s new browsing capabilities, and a final Judge evaluates the arguments for quality and coherence. If an agent repeatedly makes factual errors, they’re automatically disqualified—ensuring the debate stays grounded in truth.

This article offers an in-depth look at the advanced architecture and dynamic workflows that power autonomous AI debates. I’ll walk you through how Deb8flow’s modular design leverages LangGraph’s state management and conditional routing, alongside GPT-4o’s capabilities.

Even if you’re new to AI agents or LangGraph (see resources [1] and [2] for primers), I’ll explain the key concepts clearly. And if you’d like to explore further, the full project is available on GitHub: iason-solomos/Deb8flow.

Ready to see how AI agents can debate autonomously in practice?

Let’s dive in.

High-Level Overview: Autonomous Debates with Multiple Agents

In Deb8flow, we orchestrate a formal debate between two AI agents – one arguing Pro and one Con – complete with a Moderator, a Fact Checker, and a final Judge. The debate unfolds autonomously, with each agent playing a role in a structured format.

At its core, Deb8flow is a LangGraph-powered agent system, built atop LangChain, using GPT-4o to power each role—Pro, Con, Judge, and beyond. We use GPT-4o’s preview model with browsing capabilities to enable real-time fact-checking. In essence, the Pro and Con agents debate; after each statement, a fact-checker agent uses GPT-4o’s web search to catch any hallucinations or inaccuracies in that statement in real time. The debate only continues once the statement is verified. The whole process is coordinated by a LangGraph-defined workflow that ensures proper turn-taking and conditional logic.

High-level debate flow graph. Each rectangle is an agent node (Pro/Con debaters, Fact Checker, Judge, etc.), and diamonds are control nodes (Moderator and a router after fact-checking). Solid arrows denote the normal progression, while dashed arrows indicate retries if a claim fails fact-check. The Judge node outputs the final verdict, then the workflow ends.
Image generated by the author with DALL-E

The debate workflow goes through these stages:

Topic Generation: A Topic Generator agent produces a nuanced, debatable topic for the session (e.g. “Should AI be used in classroom education?”).
Opening: The Pro Argument Agent makes an opening statement in favor of the topic, kicking off the debate.
Rebuttal: The Debate Moderator then gives the floor to the Con Argument agent, who rebuts the Pro’s opening statement.
Counter: The Moderator gives the floor back to the Pro agent, who counters the Con agent’s points.
Closing: The Moderator switches the floor to the Con agent one last time for a closing argument.
Judgment: Finally, the Judge agent reviews the full debate history and evaluates both sides based on argument quality, clarity, and persuasiveness. The most convincing side wins.

After every single speech, the Fact Checker agent steps in to verify the factual accuracy of that statement. If a debater’s claim doesn’t hold up (e.g. cites a wrong statistic or “hallucinates” a fact), the workflow triggers a retry: the speaker has to correct or modify their statement. (If either debater accumulates 3 fact-check failures, they are automatically disqualified for repeatedly spreading inaccuracies, and their opponent wins by default.) This mechanism keeps our AI debaters honest and grounded in reality!

Prerequisites and Setup

Before diving into the code, make sure you have the following in place:

Python 3.12+ installed.
An OpenAI API key with access to the GPT-4o model. You can create your own API key here: https://platform.openai.com/settings/organization/api-keys
Project Code: Clone the Deb8flow repository from GitHub (git clone https://github.com/iason-solomos/Deb8flow.git). The repo includes a requirements.txt for all required packages. Key dependencies include LangChain/LangGraph (for building the agent graph) and the OpenAI Python client.
Install Dependencies: In your project directory, run: pip install -r requirements.txt to install the necessary libraries.
Create a .env file in the project root to hold your OpenAI API credentials. It should be of the form: OPENAI_API_KEY_GPT4O = "sk-…"
You can also at any time check out the README file: https://github.com/iason-solomos/Deb8flow if you simply want to run the finished app.

Once dependencies are installed and the environment variable is set, you should be ready to run the app. The project structure is organized for clarity:

Deb8flow/
├── configurations/
│ ├── debate_constants.py
│ └── llm_config.py
├── nodes/
│ ├── base_component.py
│ ├── topic_generator_node.py
│ ├── pro_debater_node.py
│ ├── con_debater_node.py
│ ├── debate_moderator_node.py
│ ├── fact_checker_node.py
│ ├── fact_check_router_node.py
│ └── judge_node.py
├── prompts/
│ ├── topic_generator_prompts.py
│ ├── pro_debater_prompts.py
│ ├── con_debater_prompts.py
│ └── … (prompts for other agents)
├── tests/ (contains unit and whole workflow tests)
└── debate_workflow.py

A quick tour of this structure:

configurations/ holds constant definitions and LLM configuration classes.

nodes/ contains the implementation of each agent or functional node in the debate (each of these is a module defining one agent’s behavior).

prompts/ stores the prompt templates for the language model (so each agent knows how to prompt GPT-4o for its specific task).

debate_workflow.py ties everything together by defining the LangGraph workflow (the graph of nodes and transitions).

debate_state.py defines the shared data structure that the agents will be using on each run.

tests/ includes some basic tests and example runs to help you verify everything is working.

Under the Hood: State Management and Workflow Setup

To coordinate a complex multi-turn debate, we need a shared state and a well-defined flow. We’ll start by looking at how Deb8flow defines the debate state and constants, and then see how the LangGraph workflow is constructed.

Defining the Debate State Schema (`debate_state.py`)

Deb8flow uses a shared state (https://langchain-ai.github.io/langgraph/concepts/low_level/#state ) in the form of a Python TypedDict that all agents can read from and update. This state tracks the debate’s progress and context – things like the topic, the history of messages, whose turn it is, etc. By centralizing this information, each agent node can make decisions based on the current state of the debate.

Link: debate_state.py

from typing import TypedDict, List, Dict, Literal


DebateStage = Literal["opening", "rebuttal", "counter", "final_argument"]

class DebateMessage(TypedDict):
    speaker: str  # e.g. pro or con
    content: str  # The message each speaker produced
    validated: bool  # Whether the FactChecker ok’d this message
    stage: DebateStage # The stage of the debate when this message was produced

class DebateState(TypedDict):
    debate_topic: str
    positions: Dict[str, str]
    messages: List[DebateMessage]
    opening_statement_pro_agent: str
    stage: str  # "opening", "rebuttal", "counter", "final_argument"
    speaker: str  # "pro" or "con"
    times_pro_fact_checked: int # The number of times the pro agent has been fact-checked. If it reaches 3, the pro agent is disqualified.
    times_con_fact_checked: int # The number of times the con agent has been fact-checked. If it reaches 3, the con agent is disqualified.

Key fields that we need to have in the DebateState include:

debate_topic (str): The topic being debated.
messages (List[DebateMessage]): A list of all messages exchanged so far. Each message is a dictionary with fields for speaker (e.g. "pro" or "con" or "fact_checker"), the message content (text), a validated flag (whether it passed fact-check), and the stage of the debate when it was produced.
stage (str): The current debate stage (one of "opening", "rebuttal", "counter", "final_argument").
speaker (str): Whose turn it is currently ("pro" or "con").
times_pro_fact_checked / times_con_fact_checked (int): Counters for how many times each side has been caught with a false claim. (In our rules, if a debater fails fact-check 3 times, they could be disqualified or automatically lose.)
positions (Dict[str, str]): (Optional) A mapping of each side’s general stance (e.g., "pro": "In favor of the topic").

By structuring the debate’s state, agents find it easy to access the conversation history or check the current stage, and the control logic can update the state between turns. The state is essentially the memory of the debate.

Constants and Configuration

To avoid “magic strings” scattered in the code, we define some constants in debate_constants.py. For example, constants for stage names (STAGE_OPENING = "opening", etc.), speaker identifiers (SPEAKER_PRO = "pro", SPEAKER_CON = "con", etc.), and node names (NODE_PRO_DEBATER = "pro_debater_node", etc.). These make the code easier to maintain and read.

debate_constants.py:

# Stage names
STAGE_OPENING = "opening"
STAGE_REBUTTAL = "rebuttal"
STAGE_COUNTER = "counter"
STAGE_FINAL_ARGUMENT = "final_argument"
STAGE_END = "end"

# Speakers
SPEAKER_PRO = "pro"
SPEAKER_CON = "con"
SPEAKER_JUDGE = "judge"

# Node names
NODE_PRO_DEBATER = "pro_debater_node"
NODE_CON_DEBATER = "con_debater_node"
NODE_DEBATE_MODERATOR = "debate_moderator_node"
NODE_JUDGE = "judge_node"

We also set up LLM configuration in llm_config.py. Here, we define classes for OpenAI or Azure OpenAI configs and then create a dictionary llm_config_map mapping model names to their config. For instance, we map "gpt-4o" to an OpenAILLMConfig that holds the model name and API key. This way, whenever we need to initialize a GPT-4o agent, we can just do llm_config_map["gpt-4o"] to get the right config. All our main agents (debaters, topic generator, judge) use this same GPT-4o configuration.

import os
from dataclasses import dataclass
from typing import Union

@dataclass
class OpenAILLMConfig:
    """
    A data class to store configuration details for OpenAI models.

    Attributes:
        model_name (str): The name of the OpenAI model to use.
        openai_api_key (str): The API key for authenticating with the OpenAI service.
    """
    model_name: str
    openai_api_key: str


llm_config_map = {
    "gpt-4o": OpenAILLMConfig(
        model_name="gpt-4o",
        openai_api_key=os.getenv("OPENAI_API_KEY_GPT4O"),
    )
}

Building the LangGraph Workflow (`debate_workflow.py`)

With state and configs in place, we construct the debate workflow graph. LangGraph’s StateGraph is the backbone that connects all our agent nodes in the order they should execute. Here’s how we set it up:

class DebateWorkflow:

    def _initialize_workflow(self) -> StateGraph:
        workflow = StateGraph(DebateState)
        # Nodes
        workflow.add_node("generate_topic_node", GenerateTopicNode(llm_config_map["gpt-4o"]))
        workflow.add_node("pro_debater_node", ProDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("con_debater_node", ConDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("fact_check_node", FactCheckNode())
        workflow.add_node("fact_check_router_node", FactCheckRouterNode())
        workflow.add_node("debate_moderator_node", DebateModeratorNode())
        workflow.add_node("judge_node", JudgeNode(llm_config_map["gpt-4o"]))

        # Entry point
        workflow.set_entry_point("generate_topic_node")

        # Flow
        workflow.add_edge("generate_topic_node", "pro_debater_node")
        workflow.add_edge("pro_debater_node", "fact_check_node")
        workflow.add_edge("con_debater_node", "fact_check_node")
        workflow.add_edge("fact_check_node", "fact_check_router_node")
        workflow.add_edge("judge_node", END)
        return workflow



    async def run(self):
        workflow = self._initialize_workflow()
        graph = workflow.compile()
        # graph.get_graph().draw_mermaid_png(output_file_path="workflow_graph.png")
        initial_state = {
            "topic": "",
            "positions": {}
        }
        final_state = await graph.ainvoke(initial_state, config={"recursion_limit": 50})
        return final_state

Let’s break down what’s happening:

We initialize a new StateGraph with our DebateState type as the state schema.
We add each node (agent) to the graph with a name. For nodes that need an LLM, we pass in the GPT-4o config. For example, "pro_debater_node" is added as ProDebaterNode(llm_config_map["gpt-4o"]), meaning the Pro debater agent will use GPT-4o as its underlying model.
We set the entry point of the graph to "generate_topic_node". This means the first step of the workflow is to generate a debate topic.
Then we add directed edges to connect nodes. The edges above encode the primary sequence: topic -> pro’s turn -> fact-check -> (then a routing decision) -> … eventually -> judge -> END. We don’t connect the Moderator or Fact Check Router with static edges, since these nodes use dynamic commands to redirect the flow. The final edge connects the judge to an END marker to terminate the graph.

When the workflow runs, control will pass along these edges in order, but whenever we hit a router or moderator node, that node will output a command telling the graph which node to go to next (overriding the default edge). This is how we create conditional loops: the fact_check_router_node might send us back to a debater node for a retry, instead of following a straight line. LangGraph supports this by allowing nodes to return a special Command object with goto instructions.

In summary, at a high level we’ve defined an agentic workflow: a graph of autonomous agents where control can branch and loop based on the agents’ outputs. Now, let’s explore what each of these agent nodes actually does.

Agent Nodes Breakdown

Each stage or role in the debate is encapsulated in a node (agent). In LangGraph, nodes are often simple functions, but I wanted a more object-oriented approach for clarity and reusability. So in Deb8flow, every node is a class with a __call__ method. All the main agent classes inherit from a common BaseComponent for shared functionality. This design makes the system modular: we can easily swap out or extend agents by modifying their class definitions, and each agent class is responsible for its piece of the workflow.

Let’s go through the key agents one by one.

`BaseComponent` – A Reusable Agent Base Class

Most of our agent nodes (like the debaters and judge) share common needs: they use an LLM to generate output, they might need to retry on errors, and they should track token usage. The BaseComponent class (defined in nodes/base_component.py) provides these common features so we don’t repeat code.

class BaseComponent:
    """
    A foundational class for managing LLM-based workflows with token tracking.
    Can handle both Azure OpenAI (AzureChatOpenAI) and OpenAI (ChatOpenAI).
    """

    def __init__(
        self,
        llm_config: Optional[LLMConfig] = None,
        temperature: float = 0.0,
        max_retries: int = 5,
    ):
        """
        Initializes the BaseComponent with optional LLM configuration and temperature.

        Args:
            llm_config (Optional[LLMConfig]): Configuration for either Azure or OpenAI.
            temperature (float): Controls the randomness of LLM outputs. Defaults to 0.0.
            max_retries (int): How many times to retry on 429 errors.
        """
        logger = logging.getLogger(self.__class__.__name__)
        tracer = trace.get_tracer(__name__, tracer_provider=get_tracer_provider())

        self.logger = logger
        self.tracer = tracer
        self.llm: Optional[ChatOpenAI] = None
        self.output_parser: Optional[StrOutputParser] = None
        self.state: Optional[DebateState] = None
        self.prompt_template: Optional[ChatPromptTemplate] = None
        self.chain: Optional[RunnableSequence] = None
        self.documents: Optional[List] = None
        self.prompt_tokens = 0
        self.completion_tokens = 0
        self.max_retries = max_retries

        if llm_config is not None:
            self.llm = self._init_llm(llm_config, temperature)
            self.output_parser = StrOutputParser()

    def _init_llm(self, config: LLMConfig, temperature: float):
        """
        Initializes an LLM instance for either Azure OpenAI or standard OpenAI.
        """
        if isinstance(config, AzureOpenAILLMConfig):
            # If it's Azure, use the AzureChatOpenAI class
            return AzureChatOpenAI(
                deployment_name=config.deployment_name,
                azure_endpoint=config.azure_endpoint,
                openai_api_version=config.openai_api_version,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        elif isinstance(config, OpenAILLMConfig):
            # If it's standard OpenAI, use the ChatOpenAI class
            return ChatOpenAI(
                model_name=config.model_name,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        else:
            raise ValueError("Unsupported LLMConfig type.")

    def validate_initialization(self) -> None:
        """
        Ensures we have an LLM and an output parser.
        """
        if not self.llm:
            raise ValueError("LLM is not initialized. Ensure `llm_config` is provided.")
        if not self.output_parser:
            raise ValueError("Output parser is not initialized.")

    def execute_chain(self, inputs: Any) -> Any:
        """
        Executes the LLM chain, tracks token usage, and retries on 429 errors.
        """
        if not self.chain:
            raise ValueError("No chain is initialized for execution.")

        retry_wait = 1  # Initial wait time in seconds

        for attempt in range(self.max_retries):
            try:
                with get_openai_callback() as cb:
                    result = self.chain.invoke(inputs)
                    self.logger.info("Prompt Token usage: %s", cb.prompt_tokens)
                    self.logger.info("Completion Token usage: %s", cb.completion_tokens)
                    self.prompt_tokens = cb.prompt_tokens
                    self.completion_tokens = cb.completion_tokens

                return result

            except Exception as e:
                # If the error mentions 429, do exponential backoff and retry
                if "429" in str(e):
                    self.logger.warning(
                        f"Rate limit reached. Retrying in {retry_wait} seconds... "
                        f"(Attempt {attempt + 1}/{self.max_retries})"
                    )
                    time.sleep(retry_wait)
                    retry_wait *= 2
                else:
                    self.logger.error(f"Unexpected error: {str(e)}")
                    raise e

        raise Exception("API request failed after maximum number of retries")

    def create_chain(
        self, system_template: str, human_template: str
    ) -> RunnableSequence:
        """
        Creates a chain for unstructured outputs.
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm | self.output_parser
        return self.chain

    def create_structured_output_chain(
        self, system_template: str, human_template: str, output_model: Type[BaseModel]
    ) -> RunnableSequence:
        """
        Creates a chain that yields structured outputs (parsed into a Pydantic model).
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm.with_structured_output(output_model)
        return self.chain

    def build_return_with_tokens(self, node_specific_data: dict) -> dict:
        """
        Convenience method to add token usage info into the return values.
        """
        return {
            **node_specific_data,
            "prompt_tokens": self.prompt_tokens,
            "completion_tokens": self.completion_tokens,
        }

    def __call__(self, state: DebateState) -> None:
        """
        Updates the node's local copy of the state.
        """
        self.state = state
        for key, value in state.items():
            setattr(self, key, value)

Key features of BaseComponent:

It stores an LLM client (e.g. an OpenAI ChatOpenAI instance) initialized with a given model and API key, as well as an output parser.
It provides a method create_chain(system_template, human_template) which sets up a LangChain prompt chain (a RunnableSequence) combining a system prompt and a human prompt. This chain is what actually generates outputs when run.
It has an execute_chain(inputs) method that invokes the chain and includes logic to retry if the OpenAI API returns a rate-limit error (HTTP 429). This is done with exponential backoff up to a max_retries count.
It keeps track of token usage (prompt tokens and completion tokens) for logging or analysis.
The __call__ method of BaseComponent (which each subclass will call via super().__call__(state)) can perform any setup needed before the node’s main logic runs (like ensuring the LLM is initialized).

By building on BaseComponent, each agent class can focus on its unique logic (like what prompt to use and how to handle the state), while inheriting the heavy lifting of interacting with GPT-4o reliably.

Topic Generator Agent (`GenerateTopicNode`)

The Topic Generator (topic_generator_node.py) is the first agent in the graph. Its job is to come up with a debatable topic for the session. We give it a prompt that instructs it to output a nuanced topic that could reasonably have a pro and con side.

This agent inherits from BaseComponent and uses a prompt chain (system + human prompt) to generate one item of text – the debate topic. When called, it executes the chain (with no special input, just using the prompt) and gets back a topic_text. It then updates the state with:

debate_topic: the generated topic (stripped of any extra whitespace),
positions: a dictionary assigning the pro and con stances (by default we use "In favor of the topic" and "Against the topic"),
stage: set to "opening",
speaker: set to "pro" (so the Pro side will speak first).

In code, the return might look like:

return {
    "debate_topic": debate_topic,
    "positions": positions,
    "stage": "opening",
    "speaker": first_speaker  # "pro"
}

Here are the prompts for the topic generator:

SYSTEM_PROMPT = """\
You are a brainstorming AI that suggests debate topics.
You will provide a single, interesting or timely topic that can have two opposing views.
"""

HUMAN_PROMPT = """\
Please suggest one debate topic for two AI agents to discuss.
For example, it could be about technology, politics, philosophy, or any interesting domain.
Just provide the topic in a concise sentence.
"""

Then we pass these prompts in the constructor of the class itself.

class GenerateTopicNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        # Create the prompt chain.
        self.chain: RunnableSequence = self.create_chain(
            system_template=SYSTEM_PROMPT,
            human_template=HUMAN_PROMPT
        )

    def __call__(self, state: DebateState) -> Dict[str, str]:
        """
        Generates a debate topic and assigns positions to the two debaters.
        """
        super().__call__(state)

        topic_text = self.execute_chain({})

        # Store the topic and assign stances in the DebateState
        debate_topic = topic_text.strip()
        positions = {
            "pro": "In favor of the topic",
            "con": "Against the topic"
        }

        
        first_speaker = "pro"
        self.logger.info("Welcome to our debate panel! Today's debate topic is: %s", debate_topic)
        return {
            "debate_topic": debate_topic,
            "positions": positions,
            "stage": "opening",
            "speaker": first_speaker
        }

It’s a pattern we will repeat for all classes except for those not using LLMs and the fact checker.

Now we can implement the 2 stars of the show, the Pro and Con argument agents!

Debater Agents (Pro and Con)

Link: pro_debater_node.py

The two debater agents are very similar in structure, but each uses different prompt templates tailored to their role (pro vs con) and the stage of the debate.

The Pro debater, for example, has to handle an opening statement and a counter-argument (countering the Con’s rebuttal). We also need logic for retries in case a statement fails fact-check. In code, the ProDebater class sets up multiple prompt chains:

opening_chain and an opening_retry_chain (using slightly different human prompts – the retry prompt might instruct it to try again without repeating any factually dubious claims).
counter_chain and counter_retry_chain for the counter-argument stage.

class ProDebaterNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        self.opening_chain = self.create_chain(SYSTEM_PROMPT, OPENING_HUMAN_PROMPT)
        self.opening_retry_chain = self.create_chain(SYSTEM_PROMPT, OPENING_RETRY_HUMAN_PROMPT)
        self.counter_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_HUMAN_PROMPT)
        self.counter_retry_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_RETRY_HUMAN_PROMPT)

    def __call__(self, state: DebateState) -> Dict[str, Any]:
        super().__call__(state)

        debate_topic = state.get("debate_topic")
        messages = state.get("messages", [])
        stage = state.get("stage")
        speaker = state.get("speaker")

        # Check if retrying (last message was by pro and not validated)
        last_msg = messages[-1] if messages else None
        retrying = last_msg and last_msg["speaker"] == SPEAKER_PRO and not last_msg["validated"]

        if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            chain = self.opening_retry_chain if retrying else self.opening_chain # select which chain we are triggering: the normal one or the fact-cehcked one
            result = chain.invoke({
                "debate_topic": debate_topic
            })
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            opponent_msg = self._get_last_message_by(SPEAKER_CON, messages)
            debate_history = get_debate_history(messages)
            chain = self.counter_retry_chain if retrying else self.counter_chain
            result = chain.invoke({
                "debate_topic": debate_topic,
                "opponent_statement": opponent_msg,
                "debate_history": debate_history
            })
        else:
            raise ValueError(f"Unknown turn for ProDebater: stage={stage}, speaker={speaker}")
        new_message = create_debate_message(speaker=SPEAKER_PRO, content=result, stage=stage)
        self.logger.info("Speaker: %s, Stage: %s, Retry: %s\nMessage:\n%s", speaker, stage, retrying, result)
        return {
            "messages": messages + [new_message]
        }

    def _get_last_message_by(self, speaker_prefix, messages):
        for m in reversed(messages):
            if m.get("speaker") == speaker_prefix:
                return m["content"]
        return ""

When the ProDebater’s __call__ runs, it looks at the current stage and speaker in the state to decide what to do:

If it’s the opening stage and the speaker is “pro”, it uses the opening_chain to generate an opening argument. If the last message from Pro was marked invalid (not validated), it knows this is a retry, so it would use the opening_retry_chain instead.
If it’s the counter stage and speaker is “pro”, it generates a counter-argument to whatever the opponent (Con) just said. It will fetch the last message by the Con from the messages history, and feed that into the prompt (so that the Pro can directly counter it). Again, if the last Pro message was invalid, it would switch to the retry chain.

After generating its argument, the Debater agent creates a new message entry (with speaker="pro", the content text, validated=False initially, and the stage) and appends it to the state’s message list. That becomes the output of the node (LangGraph will merge this partial state update into the global state).

The Con Debater agent mirrors this logic for its stages:

It similarly appends its message to the state.

It has a rebuttal and closing argument (final argument) stage, each with a normal and a retry chain.

It checks if it’s the rebuttal stage (speaker “con”) or final argument stage (speaker “con”) and invokes the appropriate chain, possibly using the last Pro message for context when rebutting.

con_debater_node.py

By using class-based implementation, our debaters’ code is easier to maintain. We can clearly separate what the Pro does vs what the Con does, even if they share structure. Also, by encapsulating prompt chains inside the class, each debater can manage multiple possible outputs (regular vs retry) cleanly.

Prompt design: The actual prompts (in prompts/pro_debater_prompts.py and con_debater_prompts.py) guide the GPT-4o model to take on a persona (“You are a debater arguing for/against the topic…”) and produce the argument. They also instruct the model to keep statements factual and logical. If a fact check fails, the retry prompt may say something like: “Your previous statement had an unverified claim. Revise your argument to be factually correct while maintaining your position.” – encouraging the model to correct itself.

With this, our AI debaters can engage in a multi-turn duel, and even recover from factual missteps.

Fact Checker Agent (`FactCheckNode`)

After each debater speaks, the Fact Checker agent swoops in to verify their claims. This agent is implemented in fact_checker_node.py, and interestingly, it uses the GPT-4o model’s browsing ability rather than our own custom prompts. Essentially, we delegate the fact-checking to OpenAI’s GPT-4 with web search.

How does this work? The OpenAI Python client for GPT-4 (with browsing) allows us to send a user message and get a structured response. In FactCheckNode.__call__, we do something like:

completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-search-preview",
            web_search_options={},
            messages=[{
                "role": "user",
                "content": (
                        f"Consider the following statement from a debate. "
                        f"If the statement contains numbers, or figures from studies, fact-check it online.\n\n"
                        f"Statement:\n\"{claim}\"\n\n"
                        f"Reply clearly whether any numbers or studies might be inaccurate or hallucinated, and why."
                        f"\n"
                        f"If the statement doesn't contain references to studies or numbers cited, don't go online to fact-check, and just consider it successfully fact-checked, with a 'yes' score.\n\n"
                )
            }],
            response_format=FactCheck
        )

If the result is “yes” (meaning the claim seems truthful or at least not factually wrong), the Fact Checker will mark the last message’s validated field as True in the state, and output {"validated": True} with no further changes. This signals that the debate can continue normally.

If the result is “no” (meaning it found the claim to be incorrect or dubious), the Fact Checker will append a new message to the state with speaker="fact_checker" describing the finding (or we could simply mark it, but providing a brief note like “(Fact Checker: The statistic cited could not be verified.)” can be useful). It will also set validated: False and increment a counter for whichever side made the claim. The output state from this node includes validated: False and an updated times_pro_fact_checked or times_con_fact_checked count.

We also use a Pydantic BaseModel to control the output of the LLM:

class FactCheck(BaseModel):
    """
    Pydantic model for the fact checking the claims made by debaters.

    Attributes:
        binary_score (str): 'yes' if the claim is verifiable and truthful, 'no' otherwise.
    """

    binary_score: str = Field(
        description="Indicates if the claim is verifiable and truthful. 'yes' or 'no'."
    )
    justification: str = Field(
        description="Explanation of the reasoning behind the score."
    )

Debate Moderator Agent (`DebateModeratorNode`)

The Debate Moderator is the conductor of the debate. Instead of producing lengthy text, this agent’s job is to manage turn-taking and stage progression. In the workflow, after a statement is validated by the Fact Checker, control passes to the Moderator node. The Moderator then issues a Command that updates the state for the next turn and directs the flow to the appropriate next agent.

The logic in DebateModeratorNode.__call__ (see nodes/debate_moderator_node.py) goes roughly like this:

if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_REBUTTAL, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_REBUTTAL and speaker == SPEAKER_CON:
            return Command(
                update={"stage": STAGE_COUNTER, "speaker": SPEAKER_PRO},
                goto=NODE_PRO_DEBATER
            )
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_FINAL_ARGUMENT, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_FINAL_ARGUMENT and speaker == SPEAKER_CON:
            return Command(
                update={},
                goto=NODE_JUDGE
            )

        raise ValueError(f"Unexpected stage/speaker combo: stage={stage}, speaker={speaker}")

Each conditional corresponds to a point in the debate where a turn just ended, and sets up the next turn. For example, after the opening (Pro just spoke), it sets stage to rebuttal, switches speaker to Con, and directs the workflow to the Con debater node. After the final_argument (Con’s closing), it directs to the Judge with no further update (the debate stage effectively ends).

Fact Check Router (`FactCheckRouterNode`)

This is another control node (like the Moderator) that introduces conditional logic. The Fact Check Router sits right after the Fact Checker agent in the flow. Its purpose is to branch the workflow depending on the fact-check result.

In nodes/fact_check_router_node.py, the logic is:

if pro_fact_checks >= 3 or con_fact_checks >= 3:
            disqualified = SPEAKER_PRO if pro_fact_checks >= 3 else SPEAKER_CON
            winner = SPEAKER_CON if disqualified == SPEAKER_PRO else SPEAKER_PRO

            verdict_msg = {
                "speaker": "moderator",
                "content": (
                    f"Debate ended early due to excessive factual inaccuracies.\n\n"
                    f"DISQUALIFIED: {disqualified.upper()} (exceeded fact check limit)\n"
                    f"WINNER: {winner.upper()}"
                ),
                "validated": True,
                "stage": "verdict"
            }
            return Command(
                update={"messages": messages + [verdict_msg]},
                goto=END
            )
        if last_message.get("validated"):
            return Command(goto=NODE_DEBATE_MODERATOR)
        elif speaker == SPEAKER_PRO:
            return Command(goto=NODE_PRO_DEBATER)
        elif speaker == SPEAKER_CON:
            return Command(goto=NODE_CON_DEBATER)
        raise ValueError("Unable to determine routing in FactCheckRouterNode.")

First, the Fact Check Router checks if either side’s fact-check count has reached 3. If so, it creates a Moderator-style message announcing an early end: the offending side is disqualified and the other side is the winner. It appends this verdict to the messages and returns a Command that jumps to END, effectively terminating the debate without going to the Judge (because we already know the outcome).

If we’re not ending the debate early, it then looks at the Fact Checker’s result for the last message (which is stored as validated on that message). If validated is True, we go to the debate moderator: Command(goto=debate_moderator_node).

Else if the statement fails fact-check, the workflow goes back to the debater to produce a revised statement (with the state counters updated to reflect the failure). This loop can happen multiple times if needed (up to the disqualification limit).

This dynamic control is the heart of Deb8flow’s “agentic” nature – the ability to adapt the path of execution based on the content of the agents’ outputs. It showcases LangGraph’s strength: combining control flow with state. We’re essentially encoding debate rules (like allowing retries for false claims, or ending the debate if someone cheats too often) directly into the workflow graph.

Judge Agent (`JudgeNode`)

Last but not least, the Judge agent delivers the final verdict based on rhetorical skill, clarity, structure, and overall persuasiveness. Its system prompt and human prompt make this explicit:

System Prompt: “You are an impartial debate judge AI. … Evaluate which debater presented their case more clearly, persuasively, and logically. You must focus on communication skills, structure of argument, rhetorical strength, and overall coherence.”
Human Prompt: “Here is the full debate transcript. Please analyze the performance of both debaters—PRO and CON. Evaluate rhetorical performance—clarity, structure, persuasion, and relevance—and decide who presented their case more effectively.”

When the Judge node runs, it receives the entire debate transcript (all validated messages) alongside the original topic. It then uses GPT-4o to examine how each side framed their arguments, handled counterpoints, and supported (or failed to support) claims with examples or logic. Crucially, the Judge is forbidden to evaluate which position is objectively correct (or who it thinks might be correct)—only who argued more persuasively.

Below is an example final verdict from a Deb8flow run on the topic:
“Should governments implement a universal basic income in response to increasing automation in the workforce?”

WINNER: PRO

REASON: The PRO debater presented a more compelling and rhetorically effective case for universal basic income. Their arguments were well-structured, beginning with a clear statement of the issue and the necessity of UBI in response to automation. They effectively addressed potential counterarguments by highlighting the unprecedented speed and scope of current technological changes, which distinguishes the current situation from past technological shifts. The PRO also provided empirical evidence from UBI pilot programs to counter the CON's claims about work disincentives and economic inefficiencies, reinforcing their argument with real-world examples.

In contrast, the CON debater, while presenting valid concerns about UBI, relied heavily on historical analogies and assumptions about workforce adaptability without adequately addressing the unique challenges posed by modern automation. Their arguments about the fiscal burden and potential inefficiencies of UBI were less supported by specific evidence compared to the PRO's rebuttals.

Overall, the PRO's arguments were more coherent, persuasive, and backed by empirical evidence, making their case more convincing to a neutral observer.

Langsmith Tracing

Throughout Deb8flow’s development, I relied on LangSmith (LangChain’s tracing and observability toolkit) to ensure the entire debate pipeline was behaving correctly. Because we have multiple agents passing control between themselves, it’s easy for unexpected loops or misrouted states to occur. LangSmith provides a convenient way to:

Visualize Execution Flow: You can see each agent’s prompt, the tokens consumed (so you can also track costs), and any intermediate states. This makes it much simpler to confirm that, say, the Con Debater is properly referencing the Pro Debater’s last message, or that the Fact Checker is accurately receiving the claim to verify.
Debug State Updates: If the Moderator or Fact Check Router is sending the flow to the wrong node, the trace will highlight that mismatch. You can trace which agent was invoked at each step and why, helping you spot stage or speaker misalignments early.
Track Prompt and Completion Tokens: With multiple GPT-4o calls, it’s useful to see how many tokens each stage is using, which LangSmith logs automatically if you enable tracing.

Integrating LangSmith is unexpectedly easy. You will just need to provide these 3 keys in your .env file: LANGCHAIN_API_KEY

LANGCHAIN_TRACING_V2

LANGCHAIN_PROJECT

Then you can open the LangSmith UI to see a structured trace of each run. This greatly reduces the guesswork involved in debugging multi-agent systems and is, in my experience, essential for more complex AI orchestration like ours. Example of a single run:

The trace in waterfall mode in Lansmith of one run, showing how the whole flow ran. Source: Generated by the author using Langsmith.

Reflections and Next Steps

Building Deb8flow was an eye-opening exercise in orchestrating autonomous agent workflows. We didn’t just chain a single model call – we created an entire debate simulation with AI agents, each with a specific role, and allowed them to interact according to a set of rules. LangGraph provided a clear framework to define how data and control flows between agents, making the complex sequence manageable in code. By using class-based agents and a shared state, we maintained modularity and clarity, which will pay off for any software engineering project in the long run.

An exciting aspect of this project was seeing emergent behavior. Even though each agent follows a script (a prompt), the unscripted combination – a debater trying to deceive, a fact-checker catching it, the debater rephrasing – felt surprisingly realistic! It’s a small step toward more Agentic Ai systems that can perform non-trivial multi-step tasks with oversight on each other.

There’s plenty of ideas for improvement:

User Interaction: Currently it’s fully autonomous, but one could add a mode where a human provides the topic or even takes the role of one side against an AI opponent.
We can switch the order in which the Debaters talk.
We can change the prompts, and thus to a good degree the behavior of the agents, and experiment with different prompts.
Make the debaters also perform web search before producing their statements, thus providing them with the latest information.

The broader implication of Deb8flow is how it showcases a pattern for composable AI agents. By defining clear boundaries and interactions (just like microservices in software), we can have complex AI-driven processes that remain interpretable and controllable. Each agent is like a cog in a machine, and LangGraph is the gear system making them work in unison.

I found this project energizing, and I hope it inspires you to explore multi-agent workflows. Whether it’s debating, collaborating on writing, or solving problems from different expert angles, the combination of GPT, tools, and structured agentic workflows opens up a new world of possibilities for AI development. Happy hacking!

References

[1] D. Bouchard, “From Basics to Advanced: Exploring LangGraph,” Medium, Nov. 22, 2023. [Online]. Available: https://medium.com/data-science/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787. [Accessed: Apr. 1, 2025].

[2] A. W. T. Ng, “Building a Research Agent that Can Write to Google Docs: Part 1,” Towards Data Science, Jan. 11, 2024. [Online]. Available: https://towardsdatascience.com/building-a-research-agent-that-can-write-to-google-docs-part-1-4b49ea05a292/. [Accessed: Apr. 1, 2025].

The post Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o appeared first on Towards Data Science.

Developing an Autonomous Dual-Chatbot System for Research Paper Digesting

Shuai Guo — Mon, 14 Aug 2023 18:51:06 +0000

Photo by Aaron Burden on Unsplash

As a researcher, reading and understanding scientific papers has always been a crucial part of my daily routine. I still remember the tricks I learned in grad school for how to digest a paper efficiently. However, with countless research papers being published every day, I felt overwhelmed to keep up to date with the latest research trends and insights. The old tricks I learned can only help so much.

Things start to change with the recent development of large language models (LLMs). Thanks to their remarkable contextual understanding capability, LLMs can fairly accurately identify relevant information from the user-provided documents and generate high-quality answers to the user’s questions about the documents. A myriad of document Q&A tools have been developed based on this idea and some tools are designed specifically to assist researchers in understanding complex papers within a relatively short amount of time.

Although it’s definitely a step forward, I noticed some friction points when using those tools. One of the main issues I had is prompt engineering. Since the quality of LLM responses depends heavily on the quality of my questions, I often found myself spending quite some time crafting the "perfect" question. This is especially challenging when reading papers in unfamiliar research fields: oftentimes I simply don’t know what questions to ask.

This experience got me thinking: is it possible to develop a system that can automate the process of Q&A about research papers? A system that can distill key points from a paper more efficiently and autonomously?

Previously, I worked on a project where I developed a dual-chatbot system for language learning. The concept there was simple yet effective: by letting two chatbots chat in a user-specified foreign language, the user could learn the practical usage of the language by simply observing the conversation. The success of this project led me to an interesting thought: could a similar dual-chatbot system be useful for understanding research papers as well?

So, in this blog post, we are going to bring this idea to life. Specifically, we will walk through the process of developing a dual-chatbot system that can digest research papers in an autonomous manner.

To make this journey a fun experience, we are going to approach it as a software project and run a Sprint: we will begin with "ideation", where we introduce the concept of leveraging a dual-chatbot system to tackle our problem. Then comes the "Sprint execution", during which we’ll incrementally build the features of our design. Lastly, we will show our demo in the "Sprint review" and reflect on the learnings and future opportunities in the "Sprint Retrospective".

Ready to run the Sprint? let’s get started!

This is the 2nd blog on my series of LLM projects. The 1st one is Building an AI-Powered Language Learning App, and the 3rd one is Training Soft Skills in Data Science with Real-Life Simulations. Feel free to check them out!

Table of Content

· 1. Concept: dual-chatbot system · 2. Sprint Planning: what we want to build · 3. Feature 1: Document Embedding Engine · 4. Feature 2: Dual-Chatbot System ∘ 4.1 Abstract chatbot class ∘ 4.2 Journalist chatbot class ∘ 4.3 Author bot class ∘ 4.4 Quick test: the interview · 5. Feature 3: User Interaction ∘ 5.1 Creating the chat environment (in Jupyter Notebook) ∘ 5.2 Implementing PDF highlighting functionality ∘ 5.3 Allowing user input for questions ∘ 5.4 Allowing downloading the generated script · 6. Sprint Review: show the demo! · 7. Sprint Retrospective

1. Concept: dual-chatbot system

The foundation of our solution lies in the concept of a dual-chatbot system. As its name implies, this system involves two chatbots (powered by large language models) engaging in an autonomous dialogue. By specifying a high-level task description and assigning relevant roles to the chatbots, users can guide the conversation toward their desired direction.

To give a concrete example: in my previous project where a dual-chatbot is developed for assisting language learning, the learner (user) can specify a real-life scenario (e.g., dining at a restaurant) and assign roles for chatbots to play (e.g., bot 1 as the waitstaff and bot 2 as the customer), the two bots would then simulate a conversation in the user’s chosen foreign language, mimicking the interaction between the assigned roles in the given scenario. This allows an on-demand generation of fresh, scenario-specific language learning materials, therefore helping users better understand language usage in real-life situations.

So, how do we adapt this concept for the autonomous digestion of research papers?

The key lies in the role assignment. More specifically, one bot could take the role of a "journalist", whose main task is to conduct an interview to understand and extract key insights from a research paper. Meanwhile, the other bot could play the role of an "author", who has full access to the research paper and is tasked with providing comprehensive answers to the "journalist" bot’s queries.

When it comes to interaction, the journalist bot will initiate the dialogue and kicks off the interview process. The author bot will then serve as a conventional document Q&A engine and answer the journalist’s questions based on the relevant context of the research paper. The journalist bot then follows up with additional questions for further clarification. Through this iterative Q&A process, the key contributions, methodology, and findings of the research paper could be automatically extracted.

An illustration of the workflow of the dual-chatbot system. (Image by author)

This dual-chatbot system described above introduces a shift from the traditional user-chatbot interaction: instead of users thinking about the right questions to ask the LLM model, the introduced "journalist" bot will automatically come up with suitable questions on the user’s behalf. This approach could bypass the need for users to craft appropriate prompts, thus significantly reducing the users’ cognitive load. This is especially useful when delving into unfamiliar research fields. Overall, the dual-chatbot system may constitute a more user-friendly, efficient, and engaging method for distilling complex scientific research papers.

Next up, let’s move to Sprint planning and define several user stories we would like to address in this project.

2. Sprint Planning: what we want to build

With the concept in place, it’s time to plan our current Sprint. In line with the common practice of Agile development, our Sprint planning will evolve around user stories.

In Agile development, a user story is a concise, informal, and simple description of a feature or functionality from an end-user perspective. It is a common practice used in Agile development to define and communicate requirements in a way that is understandable and actionable for the development team.

User story 1: document embedding

"As a user, I want to input research papers in PDF format into the system, and I want the system to convert my input paper into a machine-readable format so that the dual-chatbot system can understand and analyze it efficiently." (Generated by GPT-4)

This user story focuses on data ingestion. Essentially, we need to build a data-processing pipeline that includes document loading, splitting, embedding creation, and embedding storage.

Here, "embeddings" refer to the numerical representations of the text data. By creating a numerical representation of each part of a research paper, the author bot can better understand the semantic meaning of the research paper and be able to accurately answer the journalist bot’s questions.

Additionally, we need to have a database to store the computed embeddings of the research paper. This database needs to be readily accessible by the author bot to facilitate fast and accurate answer generation.

In section 3, we will address this user story by leveraging the OpenAI Embeddings API along with the meta’s FAISS vector store.

User story 2: dual-chatbot

"As a user, I want to observe an autonomous conversation between two chatbots – one playing the role of a ‘journalist’ asking questions and the other playing the role of an ‘author’ answering them, derived from the contents of the research paper. This will help me understand the paper’s key points without needing to read it in its entirety or craft my own questions." (Generated by GPT-4)

This user story represents the cornerstone of our project: the development of the dual-chatbot system. As discussed in the "Concept" section, we need to construct two types of chatbot classes: one that is able to develop a series of questions to query the details of the paper (i.e., the journalist bot), and another that can leverage document embeddings to generate comprehensive answers to these questions (i.e., the author bot).

In section 4, we will focus on addressing this user story by using the Langchain framework.

User story 3: chat environment

"As a user, I want an intuitive chat interface where I can watch the chatbots’ conversation unfold in real-time." (Generated by GPT-4)

The goal of this user story is to build a chat environment where users can view the generated dialogue between the journalist and author bots. In the spirit of MVP (minimum viable product), we will use simple Jupyter widgets to demonstrate the chat environment in section 5.1.

User story 4: PDF highlighting

"As a user, I want to have the corresponding parts in the original research paper highlighted based on the chatbot’s discussion. This will help me to quickly locate the sources of the information discussed during the conversation." (Generated by GPT-4)

This user story focuses on providing the users with the traceability of the Q&A. For every answer generated by the author bot, it is natural for users to understand precisely where the discussed information is originating from in the research paper. Not only does this feature enhances the transparency of our dual-chatbot system, but it also allows for a more interactive and engaging user experience.

In section 5.2, we will leverage LangChain’s conversational retrieval chain to return the sources the author bot used to generate the answers and the PyMuPDF library to highlight the corresponding texts in the original PDF.

User story 5: user input

"As a user, I want to be able to intervene and ask my own questions in the midst of the chatbot’s conversation, this way I can direct the conversation and extract the information I need from the paper." (Generated by GPT-4)

This user story focuses on the need for user participation. While our target dual-chatbot system is designed to be autonomous, we also need to provide the option for users to ask their own questions. This feature ensures that the conversation does not just go in a direction set by the bots, but it can be guided by the user’s own curiosity and interests. Also, it is very likely that users may get inspired by watching the first rounds of conversation, and would like to ask follow-up questions or dig deeper into certain aspects that are of particular interest to them. All these underline the importance of user intervention.

In section 5.3, we will address this user story by upgrading our user interface in Jupyter Notebook.

User story 6: download scripts

"As a user, I want to be able to download a transcript of the chatbot conversation. This will allow me to review the key points offline or share the information with my colleagues." (Generated by GPT-4)

This user story focuses on the accessibility and shareability of the generated content. Although users can view the conversation in a dedicated chat environment, it is beneficial to provide users with a record of the discussion that they can review later and share with others.

In section 5.4, we will use the PDFDocument library **** to convert the generated script into a PDF file for users to download.

So much for the planning, time to get to work!

Our planned user stories. (Image by author)

3. Feature 1: Document Embedding Engine

Let’s implement the first feature of our paper digesting app: the document embedding engine. Here, we will build a data-processing class with the functionality of document loading, splitting, embedding creation, and storage. This addresses our first user story:

"As a user, I want to input research papers in PDF format into the system, and I want the system to convert my input paper into a machine-readable format so that the dual-chatbot system can understand and analyze it efficiently." (Generated by GPT-4)

We start by creating a embedding_engine.py file and import necessary libraries:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
from langchain.utilities import ArxivAPIWrapper
import os

We then instantiate an embedding model by using OpenAI embeddings API:

class Embedder:
    """Embedding engine to create doc embeddings."""

    def __init__(self, engine='OpenAI'):
        """Specify embedding model.

        Args:
        --------------
        engine: the embedding model. 
                For a complete list of supported embedding models in LangChain, 
                see https://python.langchain.com/docs/integrations/text_embedding/
        """
        if engine == 'OpenAI':
            # Reminder: need to set up openAI API key 
            # (e.g., via environment variable OPENAI_API_KEY)
            self.embeddings = OpenAIEmbeddings()

        else:
            raise KeyError("Currently unsupported chat model type!")

Next, we define the function for loading and processing PDF files:

def load_n_process_document(self, path):
    """Load and process PDF document.

    Args:
    --------------
    path: path of the paper.
    """

    # Load PDF
    loader = PyMuPDFLoader(path)
    documents = loader.load()

    # Process PDF
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    self.documents = text_splitter.split_documents(documents)

Here, we have used PyMuPDFLoader to load the PDF file, which, under the hood, leverages the PyMuPDF library to parse the PDF file. The returned documents variable is a list of LangChain Document() objects. Each Document() object corresponds to one page of the original PDF, with the page content stored in the page_content key and associated metadata (e.g., page number, etc.) stored in the metadata key.

After parsing the loaded PDF, we used RecursiveCharacterTextSplitter from LangChain to split the original PDF into multiple smaller chunks. Since the author bot will later use relevant texts from the PDF to answer questions, creating small chunks of text can not only help the author bot to focus on specific details to answer the question, but also ensure that the context provided to the author bot will not exceed the token limit of the employed LLM.

Next, we set up the vector store to manage the text embedding vectors:

def create_vectorstore(self, store_path):
    """Create vector store for doc Q&A.
       For a complete list of vector stores supported by LangChain,
       see: https://python.langchain.com/docs/integrations/vectorstores/

    Args:
    --------------
    store_path: path of the vector store.

    Outputs:
    --------------
    vectorstore: the created vector store for holding embeddings
    """
    if not os.path.exists(store_path):
        print("Embeddings not found! Creating new ones")
        self.vectorstore = FAISS.from_documents(self.documents, self.embeddings)
        self.vectorstore.save_local(store_path)

    else:
        print("Embeddings found! Loaded the computed ones")
        self.vectorstore = FAISS.load_local(store_path, self.embeddings)

    return self.vectorstore

Here, we used Facebook AI Similarity Search (FAISS) library to serve as our vector store, which takes the loaded PDF and the embedding engine as the inputs to its constructor. The created self.vectorstore holds the embedding vectors of individual PDF chunks we created earlier. At query time, it will invoke the embedding engine to embed the question and then retrieve the embedding vectors that are ‘most similar’ to the embedded query. The texts that correspond to the most similar embedding vectors will be fed to the author bot as the context to assist its answer generation. This process is known as vector search and forms the backbone for document Q&A.

Finally, we create a helper function to generate a short summary of the paper. This will be useful later for setting the stage for the journalist bot.

def create_summary(self, llm_engine=None):
    """Create paper summary. 
    The summary is created by using LangChain's summarize_chain.

    Args:
    --------------
    llm_engine: backbone large language model.

    Outputs:
    --------------
    summary: the summary of the paper
    """

    if llm_engine is None:
        raise KeyError("please specify a LLM engine to perform summarization.")

    elif llm_engine == 'OpenAI':
        # Reminder: need to set up openAI API key 
        # (e.g., via environment variable OPENAI_API_KEY)
        llm = ChatOpenAI(
            model_name="gpt-3.5-turbo",
            temperature=0.8
        )

    else:
        raise KeyError("Currently unsupported chat model type!")

    # Use LLM to summarize the paper
    chain = load_summarize_chain(llm, chain_type="stuff")
    summary = chain.run(self.documents[:20])

    return summary

we resort to LLMs to create the summary. Technically speaking, we can achieve that goal by using LangChain’s load_summarize_chain, which takes the LLM model and the summarization method as inputs.

In terms of the summarization method, here, we have used the stuff method, which simply "stuff" all the documents into a single context and prompts the LLM to generate the summary. For other more advanced methods, please refer to the official page of LangChain.

Great! Now that we have developed the Embedder class to handle the document loading, splitting, as well as embedding creation and storage, we can move on to the core of our app: the dual-chatbot system.

4. Feature 2: Dual-Chatbot System

In this section, we address our second user story:

"As a user, I want to observe an autonomous conversation between two chatbots – one playing the role of a ‘journalist’ asking questions and the other playing the role of an ‘author’ answering them, derived from the contents of the research paper. This will help me understand the paper’s key points without needing to read it in its entirety or craft my own questions." (Generated by GPT-4)

We will start by creating an abstract base class for defining the common behaviors of the chatbots. Afterward, we will develop the individual journalist bot and the author bot that inherit from the chatbot base class. We put all the class definitions in chatbot.py.

4.1 Abstract chatbot class

Since our journalist bot and the author bot share a lot of similarities (as they are all role-playing bots), it is a good practice to encapsulate the definition of their shared behaviors within an abstract base class:

from abc import ABC, abstractmethod
from langchain.chat_models import ChatOpenAI

class Chatbot(ABC):
    """Class definition for a single chatbot with memory, created with LangChain."""

    def __init__(self, engine):
        """Initialize the large language model and its associated memory.
        The memory can be an LangChain emory object, or a list of chat history.

        Args:
        --------------
        engine: the backbone llm-based chat model.
        """

        # Instantiate llm
        if engine == 'OpenAI':
            # Reminder: need to set up openAI API key 
            # (e.g., via environment variable OPENAI_API_KEY)
            self.llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.8)

        else:
            raise KeyError("Currently unsupported chat model type!")

    @abstractmethod
    def instruct(self):
        """Determine the context of chatbot interaction. 
        """
        pass

    @abstractmethod
    def step(self):
        """Action produced by the chatbot. 
        """
        pass

    @abstractmethod
    def _specify_system_message(self):
        """Prompt engineering for chatbot.
        """       
        pass

We defined three common methods:

instruct: this method is used to set up the chatbot and attach memory to it.
step: this method is used to feed input to the chatbot and receive the bot’s response.
specify_system_message: this method is used to give the chatbot specific instructions regarding how it should behave during the conversation.

With the chatbot template in place, we are ready to create two specific chatbot roles, i.e., the journalist bot and the author bot.

4.2 Journalist chatbot class

The journalist bot’s role is to interview the author bot and extract key insights from a research paper. With that in mind, let’s fill the template methods with concrete code.

from langchain.memory import ConversationBufferMemory

class JournalistBot(Chatbot):
    """Class definition for the journalist bot, created with LangChain."""

    def __init__(self, engine):
        """Setup journalist bot.

        Args:
        --------------
        engine: the backbone llm-based chat model.
        """

        # Instantiate llm
        super().__init__(engine)

        # Instantiate memory
        self.memory = ConversationBufferMemory(return_messages=True)

In the constructor method, besides specifying a backbone LLM, another important component for the journalist bot is the memory object. Memory tracks the conversation history and serves as the key to helping the journalist bot avoid repetitive or irrelevant questions and generate meaningful follow-up questions. Technically, we achieved that by using the ConversationBufferMemory provided by LangChain, which simply prepends the last few inputs/outputs to the current input of the chatbot.

Next, we set up the journalist chatbot by creating aConversationChain, with the previously defined backbone LLM, the memory object, as well as the prompt for the chatbot. Note that we have also specified topic (the paper topic) and abstract (the paper summary), which will be used later to provide the context of the paper to the journalist bot.

from langchain.chains import ConversationChain
from langchain.prompts import (
    ChatPromptTemplate, 
    MessagesPlaceholder, 
    SystemMessagePromptTemplate, 
    HumanMessagePromptTemplate
)

def instruct(self, topic, abstract):
    """Determine the context of journalist chatbot. 

    Args:
    ------
    topic: the topic of the paper
    abstract: the abstract of the paper
    """

    self.topic = topic
    self.abstract = abstract

    # Define prompt template
    prompt = ChatPromptTemplate.from_messages([
        SystemMessagePromptTemplate.from_template(self._specify_system_message()),
        MessagesPlaceholder(variable_name="history"),
        HumanMessagePromptTemplate.from_template("""{input}""")
    ])

    # Create conversation chain
    self.conversation = ConversationChain(memory=self.memory, prompt=prompt, 
                                          llm=self.llm, verbose=False)

In LangChain, the prompt generation and ingestion for instructing the chatbot are handled via different prompt templates. For our current application, the most critical piece is setting the SystemMessagePromptTemplate, as it allows us to give a high-level purpose to the journalist bot and also define its desired behaviors.

The followings are the details of the instruction. Note that the instruction/prompt is generated and optimized by using ChatGPT (GPT-4). This is beneficial as in the current case, the LLM-generated prompts tend to consider more nuances than the human-crafted ones. Additionally, generating high-level instructions with LLM represents a more scalable solution for adapting the systems to other scenarios beyond "journalist-author" interactions.

def _specify_system_message(self):
    """Specify the behavior of the journalist chatbot.
    The prompt is generated and optimized with GPT-4.

    Outputs:
    --------
    prompt: instructions for the chatbot.
    """       

    prompt = f"""You are a technical journalist interested in {self.topic}, 
    Your task is to distill a recently published scientific paper on this topic through
    an interview with the author, which is played by another chatbot.
    Your objective is to ask comprehensive and technical questions 
    so that anyone who reads the interview can understand the paper's main ideas and contributions, 
    even without reading the paper itself. 
    You're provided with the paper's summary to guide your initial questions.
    You must keep the following guidelines in mind:
    - Focus exclusive on the technical content of the paper.
    - Avoid general questions about {self.topic}, focusing instead on specifics related to the paper.
    - Only ask one question at a time.
    - Feel free to ask about the study's purpose, methods, results, and significance, 
    and clarify any technical terms or complex concepts. 
    - Your goal is to lead the conversation towards a clear and engaging summary.
    - Do not include any prefixed labels like "Interviewer:" or "Question:" in your question.

    [Abstract]: {self.abstract}"""

    return prompt

Here, we provided the journalist bot with the paper’s research domain and abstract to serve as the base for initial questions. This mirrors the real-world scenario where a journalist initially only knows a little about the paper and needs to ask questions to gather more information.

Finally, we need a step method to interact with the journalist bot:

def step(self, prompt):
    """Journalist chatbot asks question. 

    Args:
    ------
    prompt: Previos answer provided by the author bot.
    """
    response = self.conversation.predict(input=prompt)

    return response

In this case, the input prompt will be the author bot’s answer to the journalist bot’s previous question. If the conversation has not started yet, the input prompt will simply be "Start the conversation", to prompt the journalist bot to start the interview.

That’s it for the journalist bot. Let’s now turn to the author bot.

4.3 Author bot class

The author bot’s role is to answer questions raised by the journalist bot based on the research paper. Here is the constructor method for the author bot:

class AuthorBot(Chatbot):
    """Class definition for the author bot, created with LangChain."""

    def __init__(self, engine, vectorstore, debug=False):
        """Select backbone large language model, as well as instantiate 
        the memory for creating language chain in LangChain.

        Args:
        --------------
        engine: the backbone llm-based chat model.
        vectorstore: embedding vectors of the paper.
        """

        # Instantiate llm
        super().__init__(engine)

        # Instantiate memory
        self.chat_history = []

        # Instantiate embedding index
        self.vectorstore = vectorstore

        self.debug = debug

There are two things changed here: first of all, unlike the journalist bot, the author bot should be able to access the full paper. Therefore, the vector store we created earlier needs to be provided to the constructor. Also, note that we are not using the memory object (e.g., ConversationBufferMemory) to track chat history anymore. Instead, we will simply use a list to store the history and later pass it explicitly to the author bot. Each element of the list will be a tuple of (query, answer). Both ways of maintaining conversation history are supported in LangChain.

Next, we set up the conversation chain for the author bot.

from langchain.chains import ConversationalRetrievalChain

def instruct(self, topic):
    """Determine the context of author chatbot. 

    Args:
    -------
    topic: the topic of the paper.
    """

    # Specify topic
    self.topic = topic

    # Define prompt template
    qa_prompt = ChatPromptTemplate.from_messages([
        SystemMessagePromptTemplate.from_template(self._specify_system_message()),
        HumanMessagePromptTemplate.from_template("{question}")
    ])

    # Create conversation chain
    self.conversation_qa = ConversationalRetrievalChain.from_llm(llm=self.llm, verbose=self.debug,
                                                                 retriever=self.vectorstore.as_retriever(
                                                                     search_kwargs={"k": 5}),
                                                                 return_source_documents=True,
                                                                combine_docs_chain_kwargs={'prompt': qa_prompt})

Since the author bot needs to answer questions by first retrieving relevant context, we adopted a ConversationalRetrievalChain. To quote from the official document of LangChain:

ConversationalRetrievalChain first combines the chat history (either explicitly passed in or retrieved from the provided memory) and the query into a standalone question, then looks up relevant documents from the retriever, and finally passes those documents and the query to a question answering chain to return a response.

Therefore, in addition to the backbone LLM, we also need to supply the chain with a vector store. Note that here we specified the number of returned relevant documents (PDF chunks) via search_kwargs. In general, selecting the right number is not a trivial task and deserves careful consideration of balancing accuracy, relevance, comprehensiveness, and computational resources. Lastly, we set return_source_documents to True, which is important for ensuring transparency and traceability in the Q&A process.

To interact with the author bot:

def step(self, prompt):
    """Author chatbot answers question. 

    Args:
    ------
    prompt: question raised by journalist bot.

    Outputs:
    ------
    answer: the author bot's answer
    source_documents: documents that author bot used to answer questions
    """
    response = self.conversation_qa({"question": prompt, "chat_history": self.chat_history})
    self.chat_history.append((prompt, response["answer"]))

    return response["answer"], response["source_documents"]

As discussed previously, we explicitly supplied the chat history (a list of previous query-answer tuples) to the conversation chain. As a result, we also need to manually append the newly obtained query-answer tuple to the chat history. For the response, we get not only the answer but also the source documents (PDF chunks) used by the author bot to generate the answer, which will be used later to highlight the corresponding texts in PDF.

Finally, we inform the role of the author bot and specify detailed instructions. Same as the journalist bot, the instruction/prompt for the author bot is also generated and optimized by using ChatGPT (GPT-4).

def _specify_system_message(self):
    """Specify the behavior of the author chatbot.
    The prompt is generated and optimized by GPT-4.

    Outputs:
    --------
    prompt: instructions for the chatbot.
    """       

    prompt = f"""You are the author of a recently published scientific paper on {self.topic}.
    You are being interviewed by a technical journalist who is played by another chatbot and
    looking to write an article to summarize your paper.
    Your task is to provide comprehensive, clear, and accurate answers to the journalist's questions.
    Please keep the following guidelines in mind:
    - Try to explain complex concepts and technical terms in an understandable way, without sacrificing accuracy.
    - Your responses should primarily come from the relevant content of this paper, 
    which will be provided to you in the following, but you can also use your broad knowledge in {self.topic} to 
    provide context or clarify complex topics. 
    - Remember to differentiate when you are providing information directly from the paper versus 
    when you're giving additional context or interpretation. Use phrases like 'According to the paper...' for direct information, 
    and 'Based on general knowledge in the field...' when you're providing additional context.
    - Only answer one question at a time. Ensure that each answer is complete before moving on to the next question.
    - Do not include any prefixed labels like "Author:", "Interviewee:", Respond:", or "Answer:" in your answer.
    """

    prompt += """Given the following context, please answer the question.

    {context}"""

    return prompt

That’s it for constructing the author bot.

4.4 Quick test: the interview

Time to take two bots for a ride!

To see if the developed journalist and author bot can engage in meaningful conversation toward the goal of digesting the paper, we pick one sample scientific research paper and run the test.

As I was working on physics-informed machine learning recently, here, I picked an arXiv paper named "Improved Training of Physics-Informed Neural Networks with Model Ensembles" (CC BY 4.0 license) **** for the test.

paper = 'Improved Training of Physics-Informed Neural Networks with Model Ensembles'

# Create embeddings
embedding = Embedder(engine='OpenAI')
embedding.load_n_process_document("../Papers/"+paper+".pdf")

# Set up vectorstore
vectorstore = embedding.create_vectorstore(store_path=paper)

# Fetch paper summary
paper_summary = embedding.create_summary(llm_engine='OpenAI')

# Instantiate journalist and author bot
journalist = JournalistBot('OpenAI')
author = AuthorBot('OpenAI', vectorstore)

# Provide instruction
journalist.instruct(topic='physics-informed machine learning', abstract=paper_summary)
author.instruct('physics-informed machine learning')

# Start conversation
for i in range(4):
    if i == 0:
        question = journalist.step('Start the conversation')
    else:
        question = journalist.step(answer)
    print(" ‍  Journalist: " + question)

    answer, source = author.step(question)
    print(" ‍  Author: " + answer)

The generated conversation script is shown below. Note that to save space, some of the author bot’s answers are not shown in full:

The interview between the developed journalist bot and the author bot. (Image by author)

Since the author bot only passively answers questions (i.e., a conventional Q&A agent), we focus our attention on the behavior of the journalist bot to assess if it can properly steer the interview. Here we can see that the journalist bot started with a general question about the paper (the motivation), then adapted its questions to dig deeper into the methodology of the proposed strategy. Overall, the behavior of the developed journalist bot aligns with our expectations and it is capable of conducting the interview toward distilling the key points from the given paper. Not bad

5. Feature 3: User Interaction

In this section, we wrap our previous experiment into a proper user interface. Toward that end, we will address three user stories to incrementally build the desired features.

5.1 Creating the chat environment (in Jupyter Notebook)

Let’s start with the 3rd user story:

"As a user, I want an intuitive chat interface where I can watch the chatbots’ conversation unfold in real-time." (Generated by GPT-4)

To keep things simple, we opt for Jupyter widgets as they allow quickly building a chat environment entirely in Jupyter Notebook.

First, we set up the layout of displaying conversation:

import ipywidgets as widgets
from IPython.display import display

# Create button
bot_ask = widgets.Button(description="Journalist Bot ask")

# Chat history
chat_log = widgets.HTML(
    value='',
    placeholder='',
    description='',
)

# Attach callbacks
bot_ask.on_click(bot_ask_clicked)

# Arrange widgets layout
first_row = widgets.HBox([bot_ask])

# Display the UI
display(chat_log, widgets.VBox([first_row]))

We created a button (bot_ask) such that when the user clicks it, a callback function bot_ask_clicked will be invoked and one round of conversation between the journalist and author bot will be generated. Afterward, we used the HTML widgets to display the conversation as HTML content in the notebook.

The callback function bot_ask_clicked is defined below. Besides showing the journalist bot’s question and the author bot’s answer, we also indicated the location (i.e., page number) of the relevant source texts. This is possible because the step() method of the author bot also returns the source variable, which is a list of LangChain Document object that contains the page content and its associated metadata.

def bot_ask_clicked(b):

    if chat_log.value == '':
        # Starting conversation 
        bot_question = journalist.step("Start the conversation")
        line_breaker = ""

    else:
        # Ongoing conversation
        bot_question = journalist.step(chat_log.value.split("

")[-1])
        line_breaker = "

"

    # Journalist question
    chat_log.value += line_breaker + " ‍  Journalist Bot: " + bot_question      

    # Author bot answers
    response, source = author.step(bot_question)  

    # Author answer with source
    page_numbers = [str(src.metadata['page']+1) for src in source]
    unique_page_numbers = list(set(page_numbers))
    chat_log.value += "
 ‍  Author Bot: " + response + "
"
    chat_log.value += "(For details, please check the highlighted text on page(s): " + ', '.join(unique_page_numbers) + ")"

Putting everything together, we have the following interface:

Chat interface. (Image by author)

5.2 Implementing PDF highlighting functionality

In our current UI, we only indicated on which pages the author bot looked for the answers to the journalist bot’s question. Ideally, the user would expect the relevant texts to be highlighted in the original PDF to allow quick reference. This is the motivation for the 4th user story:

"As a user, I want to have the corresponding parts in the original research paper highlighted based on the chatbot’s discussion. This will help me to quickly locate the sources of the information discussed during the conversation." (Generated by GPT-4)

To achieve this goal, we employed the PyMuPDF library to search for relevant texts and perform text highlighting:

import fitz

def highlight_PDF(file_path, phrases, output_path):
    """Search and highlight given texts in PDF.

    Args:
    --------
    file_path: PDF file path
    phrases: a list of texts (in string)
    output_path: save and output PDF
    """

    # Open PDF
    doc = fitz.open(file_path)

    # Search the doc
    for page in doc:
        for phrase in phrases:            
            text_instances = page.search_for(phrase)

            # Highlight texts
            for inst in text_instances:
                highlight = page.add_highlight_annot(inst)

    # Output PDF
    doc.save(output_path, garbage=4)

In the code above, the phrases is a list of strings, where each string represents one of the source texts used by the author bot to generate the answers. To highlight the texts, the code first loops over each page of the PDF and find if the phraseis contained on that page. Once the phrase is found, it will be highlighted in the original PDF.

To integrate this highlighting functionality into our previously developed chat UI, we first need to update the callback function:

def create_bot_ask_callback(title):

    def bot_ask_clicked(b):

        if chat_log.value == '':
            # Starting conversation 
            bot_question = journalist.step("Start the conversation")
            line_breaker = ""

        else:
            # Ongoing conversation
            bot_question = journalist.step(chat_log.value.split("

")[-1])
            line_breaker = "

"

        chat_log.value += line_breaker + " ‍  Journalist Bot: " + bot_question      

        # Author bot answers
        response, source = author.step(bot_question)  

        ##### NEW: Highlight relevant text in PDF
        phrases = [src.page_content for src in source]
        paper_path = "../Papers/"+title+".pdf"
        highlight_PDF(paper_path, phrases, 'highlighted.pdf')
        ##### NEW

        page_numbers = [str(src.metadata['page']+1) for src in source]
        unique_page_numbers = list(set(page_numbers))
        chat_log.value += "
 ‍  Author Bot: " + response + "
"
        chat_log.value += "(For details, please check the highlighted text on page(s): " + ', '.join(unique_page_numbers) + ")"

    return bot_ask_clicked

Although the appearance of our UI stays the same:

under the hood, we would have a new PDF file, with relevant texts (on pages 1 and 10) properly highlighted:

5.3 Allowing user input for questions

Up till now, all the conversations between the two bots are autonomous. Ideally, users should also be able to ask their own questions if they see fit. This is exactly what we want to address for the 5th user story:

"As a user, I want to be able to intervene and ask my own questions in the midst of the chatbot’s conversation, this way I can direct the conversation and extract the information I need from the paper." (Generated by GPT-4)

To achieve that goal, we can add another button such that the user can decide if a new round of exchange should be initiated by the journalist bot or the user:

# Create "user ask" button
user_ask = widgets.Button(description="User ask")

# Define callback
def create_user_ask_callback(title):

    def user_ask_clicked(b):

        chat_log.value += "

 ‍You: " + user_input.value

        # Author bot answers
        response, source = author.step(user_input.value)

        # Highlight relevant text in PDF
        phrases = [src.page_content for src in source]
        paper_path = "../Papers/"+title+".pdf"
        highlight_PDF(paper_path, phrases, 'highlighted.pdf')

        page_numbers = [str(src.metadata['page']+1) for src in source]
        unique_page_numbers = list(set(page_numbers))
        chat_log.value += "
 ‍  Author Bot: " + response + "
"
        chat_log.value += "(For details, please check the highlighted text on page(s): " + ', '.join(unique_page_numbers) + ")"

        # Inform journalist bot about the asked questions 
        journalist.memory.chat_memory.add_user_message(user_input.value)

        # Clear user input
        user_input.value = ""

    return user_ask_clicked

The above callback function is essentially the same as the callback function for defining journalist-author interaction. The only difference is that the "question" will be directly input by the user. Also, to make the interview logic consistent, we appended the user question to the journalist bot’s memory, as if the user-supplied question was raised by the journalist bot.

We updated the main UI logic accordingly:

# Chat history
chat_log = widgets.HTML(
    value='',
    placeholder='',
    description='',
)

# User input question
user_input = widgets.Text(
    value='',
    placeholder='Question',
    description='',
    disabled=False,
    layout=widgets.Layout(width="60%")
)

# Attach callbacks
bot_ask.on_click(create_bot_ask_callback(paper))
user_ask.on_click(create_user_ask_callback(paper))

# Arrange the widgets
first_row = widgets.HBox([bot_ask])
second_row = widgets.HBox([user_ask, user_input])

# Display the UI
display(chat_log, widgets.VBox([first_row, second_row]))

And this is what we got, where users can input their own questions and get answered by the author bot:

Besides letting the journalist bot ask questions, users also have the opportunity to ask their own questions. (Image by author)

5.4 Allowing downloading the generated script

So far so good! As the last feature to implement, we want to be able to save the conversation history to our disk for later reference. This is the goal of the 6th user story:

"As a user, I want to be able to download a transcript of the chatbot conversation. This will allow me to review the key points offline or share the information with my colleagues." (Generated by GPT-4)

Toward that end, we added another button for downloading the script and attach a callback function to the button. In this callback, we used PDFDocument to convert the conversation script into a PDF file:

from pdfdocument.document import PDFDocument

download = widgets.Button(description="Download paper summary",
                         layout=widgets.Layout(width='auto'))

def create_download_callback(title):

    def download_clicked(b):
        pdf = PDFDocument('paper_summary.pdf')
        pdf.init_report()

        # Remove HTML tags
        chat_history = re.sub('<.*?>', '', chat_log.value)  

        # Remove emojis
        chat_history = chat_history.replace(' ‍ ', '')
        chat_history = chat_history.replace(' ‍ ', '')
        chat_history = chat_history.replace(' ‍', '')

        # Add line breaks
        chat_history = chat_history.replace('Journalist Bot:', 'nnnJournalist: ')
        chat_history = chat_history.replace('Author Bot:', 'nnAuthor: ')
        chat_history = chat_history.replace('You:', 'nnnYou: ')

        pdf.h2("Paper Summary: " + title)
        pdf.p(chat_history)
        pdf.generate()

        # Download PDF
        print('PDF generated successfully in the local folder!')

    return download_clicked

We updated the main UI logic accordingly:

# Chat history
chat_log = widgets.HTML(
    value='',
    placeholder='',
    description='',
)

# User input question
user_input = widgets.Text(
    value='',
    placeholder='Question',
    description='',
    disabled=False,
    layout=widgets.Layout(width="60%")
)

# Attach callbacks
bot_ask.on_click(create_bot_ask_callback(paper))
user_ask.on_click(create_user_ask_callback(paper))
download.on_click(create_download_callback(paper))

# Arrange the widgets
first_row = widgets.HBox([bot_ask])
second_row = widgets.HBox([user_ask, user_input])
third_row = widgets.HBox([download])

# Display the UI
display(chat_log, widgets.VBox([first_row, second_row, third_row]))

Now, we have a download button appearing in the UI. When the user clicks it, a paper summary PDF file will be automatically generated and downloaded to the local folder:

Users now have the option to download the script of the generated conversation. (Image by author)

6. Sprint Review: show the demo!

It’s time to put up a demo to showcase our hard work

In this demo, we showed the full functionality of our developed dual-chatbot system:

The two bots can autonomously engage in an interview with the goal of digesting the main points from the paper.
The user can jump into the conversation as well and ask interested questions.
Relevant texts for the generated answers are automatically highlighted in the original PDF.
The conversation history can be downloaded to the local folder.

We have successfully addressed all the user stories, good work Now Sprint review is over, time for some retrospectives.

7. Sprint Retrospective

In this project, we focused on solving the problem of efficiently digesting complex research papers. Toward that end, we developed a dual-chatbot system where one bot plays the "journalist" while the other bot plays the "author", and two bots are engaged in an interview. In doing so, the journalist bot can act on behalf of the user and query the key points of the paper. This is beneficial as it eliminates the need for users to devise their own questions – an activity that can be challenging and time-consuming, particularly when dealing with unfamiliar subjects.

The success of the devised dual-chatbot approach relies critically on the journalist bot’s ability to steer the interview and generate insightful and relevant questions. In the current implementation, we used GPT-3.5-Turbo as the backbone LLM. To further enhance the user experience, it may be necessary to employ GPT-4 to boost the journalist bot’s reasoning capability.

What’s also important is that the journalist bot needs to be capable of interpreting and understanding the technical terms and concepts used in the broader research field to which the paper belongs. Besides using advanced LLM, fine-tuning the existing LLM on research papers of the target domain could be a promising strategy to pursue.

Looking ahead, there are several possibilities to extend our current project:

Better UI design. For simplicity, we have used Jupyter Notebook to showcase the main idea of the dual-chatbot system. We could certainly use more sophisticated libraries (e.g., Streamlit) to build a more user-friendly, engaging UI.
Multimodal capability. For example, text-to-speech (TTS) techniques can be used to create audio over the generated script. This could be beneficial to users as they can keep consuming the content during a commute, exercising, or other activities where reading isn’t convenient.
Accessing external databases. It would be great if the dual-chatbot system could have access to larger external repositories of research papers, such that the author bot could offer comparison analysis with respect to the latest developments in the fields of interest, thereby synthesizing insights across multiple papers.
Generating literature review. Since the generated interview scripts can serve as condensed yet richer (than paper abstracts) versions of the full papers, we could first accumulate the scripts for a variety of papers in a specific research field, and then request a separate LLM to generate comprehensive reviews of that field, based on analyzing the accumulated interview scripts. This feature would be especially valuable for researchers when they are initiating a new research project or a literature review paper.

What a fruitful Sprint we had! If you find my content useful, you could buy me a coffee here Thank you very much for your support! As always, you can find the companion notebook with full code here Looking forward to sharing with you more exciting LLM projects. Stay tuned!

The post Developing an Autonomous Dual-Chatbot System for Research Paper Digesting appeared first on Towards Data Science.