Deep Dives | Towards Data Science https://towardsdatascience.com/tag/deep-dives/ The world’s leading publication for data science, AI, and ML professionals. Fri, 11 Apr 2025 18:50:41 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Deep Dives | Towards Data Science https://towardsdatascience.com/tag/deep-dives/ 32 32 Learnings from a Machine Learning Engineer — Part 6: The Human Side https://towardsdatascience.com/learnings-from-a-machine-learning-engineer-part-6-the-human-side/ Fri, 11 Apr 2025 18:44:39 +0000 https://towardsdatascience.com/?p=605720 Practical advice for the humans involved with machine learning

The post Learnings from a Machine Learning Engineer — Part 6: The Human Side appeared first on Towards Data Science.

]]>
In my previous articles, I have spent a lot of time talking about the technical aspects of an Image Classification problem from data collectionmodel evaluationperformance optimization, and a detailed look at model training.

These elements require a certain degree of in-depth expertise, and they (usually) have well-defined metrics and established processes that are within our control.

Now it’s time to consider…

The human aspects of machine learning

Yes, this may seem like an oxymoron! But it is the interaction with people — the ones you work with and the ones who use your application — that help bring the technology to life and provide a sense of fulfillment to your work.

These human interactions include:

  • Communicating technical concepts to a non-technical audience.
  • Understanding how your end-users engage with your application.
  • Providing clear expectations on what the model can and cannot do.

I also want to touch on the impact to people’s jobs, both positive and negative, as AI becomes a part of our everyday lives.

Overview

As in my previous articles, I will gear this discussion around an image classification application. With that in mind, these are the groups of people involved with your project:

  • AI/ML Engineer (that’s you) — bringing life to the Machine Learning application.
  • MLOps team — your peers who will deploy, monitor, and enhance your application.
  • Subject matter experts — the ones who will provide the care and feeding of labeled data.
  • Stakeholders — the ones who are looking for a solution to a real world problem.
  • End-users — the ones who will be using your application. These could be internal and external customers.
  • Marketing — the ones who will be promoting usage of your application.
  • Leadership — the ones who are paying the bill and need to see business value.

Let’s dive right in…

AI/ML Engineer

You may be a part of a team or a lone wolf. You may be an individual contributor or a team leader.

Photo by Christina @ wocintechchat.com on Unsplash

Whatever your role, it is important to see the whole picture — not only the coding, the data science, and the technology behind AI/ML — but the value that it brings to your organization.

Understand the business needs

Your company faces many challenges to reduce expenses, improve customer satisfaction, and remain profitable. Position yourself as someone who can create an application that helps achieve their goals.

  • What are the pain points in a business process?
  • What is the value of using your application (time savings, cost savings)?
  • What are the risks of a poor implementation?
  • What is the roadmap for future enhancements and use-cases?
  • What other areas of the business could benefit from the application, and what design choices will help future-proof your work?

Communication

Deep technical discussions with your peers is probably our comfort zone. However, to be a more successful AI/ML Engineer, you should be able to clearly explain the work you are doing to different audiences.

With practice, you can explain these topics in ways that your non-technical business users can follow along with, and understand how your technology will benefit them.

To help you get comfortable with this, try creating a PowerPoint with 2–3 slides that you can cover in 5–10 minutes. For example, explain how a neural network can take an image of a cat or a dog and determine which one it is.

Practice giving this presentation in your mind, to a friend — even your pet dog or cat! This will get you more comfortable with the transitions, tighten up the content, and ensure you cover all the important points as clearly as possible.

  • Be sure to include visuals — pure text is boring, graphics are memorable.
  • Keep an eye on time — respect your audience’s busy schedule and stick to the 5–10 minutes you are given.
  • Put yourself in their shoes — your audience is interested in how the technology will benefit them, not on how smart you are.

Creating a technical presentation is a lot like the Feynman Technique — explaining a complex subject to your audience by breaking it into easily digestible pieces, with the added benefit of helping you understand it more completely yourself.

MLOps team

These are the people that deploy your application, manage data pipelines, and monitor infrastructure that keeps things running.

Without them, your model lives in a Jupyter notebook and helps nobody!

Photo by airfocus on Unsplash

These are your technical peers, so you should be able to connect with their skillset more naturally. You speak in jargon that sounds like a foreign language to most people. Even so, it is extremely helpful for you to create documentation to set expectations around:

  • Process and data flows.
  • Data quality standards.
  • Service level agreements for model performance and availability.
  • Infrastructure requirements for compute and storage.
  • Roles and responsibilities.

It is easy to have a more informal relationship with your MLOps team, but remember that everyone is trying to juggle many projects at the same time.

Email and chat messages are fine for quick-hit issues. But for larger tasks, you will want a system to track things like user stories, enhancement requests, and break-fix issues. This way you can prioritize the work and ensure you don’t forget something. Plus, you can show progress to your supervisor.

Some great tools exist, such as:

  • Jira, GitHub, Azure DevOps Boards, Asana, Monday, etc.

We are all professionals, so having a more formal system to avoid miscommunication and mistrust is good business.

Subject matter experts

These are the team members that have the most experience working with the data that you will be using in your AI/ML project.

Photo by National Cancer Institute on Unsplash

SMEs are very skilled at dealing with messy data — they are human, after all! They can handle one-off situations by considering knowledge outside of their area of expertise. For example, a doctor may recognize metal inserts in a patient’s X-ray that indicate prior surgery. They may also notice a faulty X-ray image due to equipment malfunction or technician error.

However, your machine learning model only knows what it knows, which comes from the data it was trained on. So, those one-off cases may not be appropriate for the model you are training. Your SMEs need to understand that clear, high quality training material is what you are looking for.

Think like a computer

In the case of an image classification application, the output from the model communicates to you how well it was trained on the data set. This comes in the form of error rates, which is very much like when a student takes an exam and you can tell how well they studied by seeing how many questions — and which ones — they get wrong.

In order to reduce error rates, your image data set needs to be objectively “good” training material. To do this, put yourself in an analytical mindset and ask yourself:

  • What images will the computer get the most useful information out of? Make sure all the relevant features are visible.
  • What is it about an image that confused the model? When it makes an error, try to understand why — objectively — by looking at the entire picture.
  • Is this image a “one-off” or a typical example of what the end-users will send? Consider creating a new subclass of exceptions to the norm.

Be sure to communicate to your SMEs that model performance is directly tied to data quality and give them clear guidance:

  • Provide visual examples of what works.
  • Provide counter-examples of what does not work.
  • Ask for a wide variety of data points. In the X-ray example, be sure to get patients with different ages, genders, and races.
  • Provide options to create subclasses of your data for further refinement. Use that X-ray from a patient with prior surgery as a subclass, and eventually as you can get more examples over time, the model can handle them.

This also means that you should become familiar with the data they are working with — perhaps not expert level, but certainly above a novice level.

Lastly, when working with SMEs, be cognizant of the impression they may have that the work you are doing is somehow going to replace their job. It can feel threatening when someone asks you how to do your job, so be mindful.

Ideally, you are building a tool with honest intentions and it will enable your SMEs to augment their day-to-day work. If they can use the tool as a second opinion to validate their conclusions in less time, or perhaps even avoid mistakes, then this is a win for everyone. Ultimately, the goal is to allow them to focus on more challenging situations and achieve better outcomes.

I have more to say on this in my closing remarks.

Stakeholders

These are the people you will have the closest relationship with.

Stakeholders are the ones who created the business case to have you build the machine learning model in the first place.

Photo by Ninthgrid on Unsplash

They have a vested interest in having a model that performs well. Here are some key point when working with your stakeholder:

  • Be sure to listen to their needs and requirements.
  • Anticipate their questions and be prepared to respond.
  • Be on the lookout for opportunities to improve your model performance. Your stakeholders may not be as close to the technical details as you are and may not think there is any room for improvement.
  • Bring issues and problems to their attention. They may not want to hear bad news, but they will appreciate honesty over evasion.
  • Schedule regular updates with usage and performance reports.
  • Explain technical details in terms that are easy to understand.
  • Set expectations on regular training and deployment cycles and timelines.

Your role as an AI/ML Engineer is to bring to life the vision of your stakeholders. Your application is making their lives easier, which justifies and validates the work you are doing. It’s a two-way street, so be sure to share the road.

End-users

These are the people who are using your application. They may also be your harshest critics, but you may never even hear their feedback.

Photo by Alina Ruf on Unsplash

Think like a human

Recall above when I suggested to “think like a computer” when analyzing the data for your training set. Now it’s time to put yourself in the shoes of a non-technical user of your application.

End-users of an image classification model communicate their understanding of what’s expected of them by way of poor images. These are like the students that didn’t study for the exam, or worse didn’t read the questions, so their answers don’t make sense.

Your model may be really good, but if end-users misuse the application or are not satisfied with the output, you should be asking:

  • Are the instructions confusing or misleading? Did the user focus the camera on the subject being classified, or is it more of a wide-angle image? You can’t blame the user if they follow bad instructions.
  • What are their expectations? When the results are presented to the user, are they satisfied or are they frustrated? You may noticed repeated images from frustrated users.
  • Are the usage patterns changing? Are they trying to use the application in unexpected ways? This may be an opportunity to improve the model.

Inform your stakeholders of your observations. There may be simple fixes to improve end-user satisfaction, or there may be more complex work ahead.

If you are lucky, you may discover an unexpected way to leverage the application that leads to expanded usage or exciting benefits to your business.

Explainability

Most AI/ML model are considered “black boxes” that perform millions of calculations on extremely high dimensional data and produce a rather simplistic result without any reason behind it.

The Answer to Ultimate Question of Life, the Universe, and Everything is 42.
— The Hitchhikers Guide to the Galaxy

Depending on the situation, your end-users may require more explanation of the results, such as with medical imaging. Where possible, you should consider incorporating model explainability techniques such as LIME, SHAP, and others. These responses can help put a human touch to cold calculations.

Now it’s time to switch gears and consider higher-ups in your organization.

Marketing team

These are the people who promote the use of your hard work. If your end-users are completely unaware of your application, or don’t know where to find it, your efforts will go to waste.

The marketing team controls where users can find your app on your website and link to it through social media channels. They also see the technology through a different lens.

Gartner hype cycle. Image from Wikipedia – https://en.wikipedia.org/wiki/Gartner_hype_cycle

The above hype cycle is a good representation of how technical advancements tends to flow. At the beginning, there can be an unrealistic expectation of what your new AI/ML tool can do — it’s the greatest thing since sliced bread!

Then the “new” wears off and excitement wanes. You may face a lack of interest in your application and the marketing team (as well as your end-users) move on to the next thing. In reality, the value of your efforts are somewhere in the middle.

Understand that the marketing team’s interest is in promoting the use of the tool because of how it will benefit the organization. They may not need to know the technical inner workings. But they should understand what the tool can do, and be aware of what it cannot do.

Honest and clear communication up-front will help smooth out the hype cycle and keep everyone interested longer. This way the crash from peak expectations to the trough of disillusionment is not so severe that the application is abandoned altogether.

Leadership team

These are the people that authorize spending and have the vision for how the application fits into the overall company strategy. They are driven by factors that you have no control over and you may not even be aware of. Be sure to provide them with the key information about your project so they can make informed decisions.

Photo by Adeolu Eletu on Unsplash

Depending on your role, you may or may not have direct interaction with executive leadership in your company. Your job is to summarize the costs and benefits associated with your project, even if that is just with your immediate supervisor who will pass this along.

Your costs will likely include:

  • Compute and storage — training and serving a model.
  • Image data collection — both real-world and synthetic or staged.
  • Hours per week — SME, MLOps, AI/ML engineering time.

Highlight the savings and/or value added:

  • Provide measures on speed and accuracy.
  • Translate efficiencies into FTE hours saved and customer satisfaction.
  • Bonus points if you can find a way to produce revenue.

Business leaders, much like the marketing team, may follow the hype cycle:

  • Be realistic about model performance. Don’t try to oversell it, but be honest about the opportunities for improvement.
  • Consider creating a human benchmark test to measure accuracy and speed for an SME. It is easy to say human accuracy is 95%, but it’s another thing to measure it.
  • Highlight short-term wins and how they can become long-term success.

Conclusion

I hope you can see that, beyond the technical challenges of creating an AI/ML application, there are many humans involved in a successful project. Being able to interact with these individuals, and meet them where they are in terms of their expectations from the technology, is vital to advancing the adoption of your application.

Photo by Vlad Hilitanu on Unsplash

Key takeaways:

  • Understand how your application fits into the business needs.
  • Practice communicating to a non-technical audience.
  • Collect measures of model performance and report these regularly to your stakeholders.
  • Expect that the hype cycle could help and hurt your cause, and that setting consistent and realistic expectations will ensure steady adoption.
  • Be aware that factors outside of your control, such as budgets and business strategy, could affect your project.

And most importantly…

Don’t let machines have all the fun learning!

Human nature gives us the curiosity we need to understand our world. Take every opportunity to grow and expand your skills, and remember that human interaction is at the heart of machine learning.

Closing remarks

Advancements in AI/ML have the potential (assuming they are properly developed) to do many tasks as well as humans. It would be a stretch to say “better than” humans because it can only be as good as the training data that humans provide. However, it is safe to say AI/ML can be faster than humans.

The next logical question would be, “Well, does that mean we can replace human workers?”

This is a delicate topic, and I want to be clear that I am not an advocate of eliminating jobs.

I see my role as an AI/ML Engineer as being one that can create tools that aide in someone else’s job or enhance their ability to complete their work successfully. When used properly, the tools can validate difficult decisions and speed through repetitive tasks, allowing your experts to spend more time on the one-off situations that require more attention.

There may also be new career opportunities, from the care-and-feeding of data, quality assessment, user experience, and even to new roles that leverage the technology in exciting and unexpected ways.

Unfortunately, business leaders may make decisions that impact people’s jobs, and this is completely out of your control. But all is not lost — even for us AI/ML Engineers…

There are things we can do

  • Be kind to the fellow human beings that we call “coworkers”.
  • Be aware of the fear and uncertainty that comes with technological advancements.
  • Be on the lookout for ways to help people leverage AI/ML in their careers and to make their lives better.

This is all part of being human.

The post Learnings from a Machine Learning Engineer — Part 6: The Human Side appeared first on Towards Data Science.

]]>
The Invisible Revolution: How Vectors Are (Re)defining Business Success https://towardsdatascience.com/the-invisible-revolution-how-vectors-are-redefining-business-success/ Thu, 10 Apr 2025 20:52:15 +0000 https://towardsdatascience.com/?p=605712 The hidden force behind AI is powering the next wave of business transformation

The post The Invisible Revolution: How Vectors Are (Re)defining Business Success appeared first on Towards Data Science.

]]>
In a world that focuses more on data, business leaders must understand vector thinking. At first, vectors may appear as complicated as algebra was in school, but they serve as a fundamental building block. Vectors are as essential as algebra for tasks like sharing a bill or computing interest. They underpin our digital systems for decision making, customer engagement, and data protection.

They represent a radically different concept of relationships and patterns. They do not simply divide data into rigid categories. Instead, they offer a dynamic, multidimensional view of the underlying connections. Like “Similar” for two customers may mean more than demographics or purchase histories. It’s their behaviors, preferences, and habits that distinctly align. Such associations can be defined and measured accurately in a vector space. But for many modern businesses, the logic is too complex. So leaders tend to fall back on old, learned, rule-based patterns instead. And back then, fraud detection, for example, still used simple rules on transaction limits. We’ve evolved to recognize patterns and anomalies.

While it might have been common to block transactions that allocate 50% of your credit card limit at once just a few years ago, we are now able to analyze your retailer-specific spend history, look at average baskets of other customers at the very same retailers, and do some slight logic checks such as the physical location of your previous spends.

So a $7,000 transaction for McDonald’s in Dubai might just not happen if you just spent $3 on a bike rental in Amsterdam. Even $20 wouldn’t work since logical vector patterns can rule out the physical distance to be valid. Instead, the $7,000 transaction for your new E-Bike at a retailer near Amsterdam’s city center may just work flawlessly. Welcome to the insight of living in a world managed by vectors.

The danger of ignoring the paradigm of vectors is huge. Not mastering algebra can lead to bad financial decisions. Similarly, not knowing vectors can leave you vulnerable as a business leader. While the average customer may stay unaware of vectors as much as an average passenger in a plane is of aerodynamics, a business leader should be at least aware of what kerosene is and how many seats are to be occupied to break even for a specific flight. You may not need to fully understand the systems you rely on. A basic understanding helps to know when to reach out to the experts. And this is exactly my aim in this little journey into the world of vectors: become aware of the basic principles and know when to ask for more to better steer and manage your business.

In the hushed hallways of research labs and tech companies, a revolution was brewing. It would change how computers understood the world. This revolution has nothing to do with processing power or storage capacity. It was all about teaching machines to understand context, meaning, and nuance in words. This uses mathematical representations called vectors. Before we can appreciate the magnitude of this shift, we first need to understand what it differs from.

Think about the way humans take in information. When we look at a cat, we don’t just process a checklist of components: whiskers, fur, four legs. Instead, our brains work through a network of relationships, contexts, and associations. We know a cat is more like a lion than a bicycle. It’s not from memorizing this fact. Our brains have naturally learned these relationships. It boils down to target_transform_sequence or equivalent. Vector representations let computers consume content in a human-like way. And we ought to understand how and why this is true. It’s as fundamental as knowing algebra in the time of an impending AI revolution.

In this brief jaunt in the vector realm, I will explain how vector-based computing works and why it’s so transformative. The code examples are only examples, so they are just for illustration and have no stand-alone functionality. You don’t have to be an engineer to understand those concepts. All you have to do is follow along, as I walk you through examples with plain language commentary explaining each one step by step, one step at a time. I don’t aim to be a world-class mathematician. I want to make vectors understandable to everyone: business leaders, managers, engineers, musicians, and others.


What are vectors, anyway?

Photo by Pete F on Unsplash

It is not that the vector-based computing journey started recently. Its roots go back to the 1950s with the development of distributed representations in cognitive science. James McClelland and David Rumelhart, among other researchers, theorized that the brain holds concepts not as individual entities. Instead, it holds them as the compiled activity patterns of neural networks. This discovery dominated the path for contemporary vector representations.

The real breakthrough was three things coming together:
The exponential growth in computational power, the development of sophisticated neural network architectures, and the availability of massive datasets for training.

It is the combination of these elements that makes vector-based systems theoretically possible and practically implementable at scale. AI as the mainstream as people got to know it (with the likes of ChatGPT e.a.) is the direct consequence of this.

To better understand, let me put this in context: Conventional computing systems work on symbols —discrete, human-readable symbols and rules. A traditional system, for instance, might represent a customer as a record:

customer = {
    'id': '12345',
    'age': 34,
    'purchase_history': ['electronics', 'books'],
    'risk_level': 'low'
}

This representation may be readable or logical, but it misses subtle patterns and relationships. In contrast, vector representations encode information within high-dimensional space where relationships arise naturally through geometric proximity. That same customer might be represented as a 384-dimensional vector where each one of these dimensions contributes to a rich, nuanced profile. Simple code allows for 2-Dimensional customer data to be transformed into vectors. Let’s take a look at how simple this just is:

from sentence_transformers import SentenceTransformer
import numpy as np

class CustomerVectorization:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        
    def create_customer_vector(self, customer_data):
        """
        Transform customer data into a rich vector representation
        that captures subtle patterns and relationships
        """
        # Combine various customer attributes into a meaningful text representation
        customer_text = f"""
        Customer profile: {customer_data['age']} year old,
        interested in {', '.join(customer_data['purchase_history'])},
        risk level: {customer_data['risk_level']}
        """
        
        # Generate base vector from text description
        base_vector = self.model.encode(customer_text)
        
        # Enrich vector with numerical features
        numerical_features = np.array([
            customer_data['age'] / 100,  # Normalized age
            len(customer_data['purchase_history']) / 10,  # Purchase history length
            self._risk_level_to_numeric(customer_data['risk_level'])
        ])
        
        # Combine text-based and numerical features
        combined_vector = np.concatenate([
            base_vector,
            numerical_features
        ])
        
        return combined_vector
    
    def _risk_level_to_numeric(self, risk_level):
        """Convert categorical risk level to normalized numeric value"""
        risk_mapping = {'low': 0.1, 'medium': 0.5, 'high': 0.9}
        return risk_mapping.get(risk_level.lower(), 0.5)

I trust that this code example has helped demonstrate how easily complex customer data can be encoded into meaningful vectors. The method seems complex at first. But, it is simple. We merge text and numerical data on customers. This gives us rich, info-dense vectors that capture each customer’s essence. What I love most about this technique is its simplicity and flexibility. Similarly to how we encoded age, purchase history, and risk levels here, you could replicate this pattern to capture any other customer attributes that boil down to the relevant base case for your use case. Just recall the credit card spending patterns we described earlier. It’s similar data being turned into vectors to have a meaning far greater than it could ever have it stayed 2-dimensional and would be used for traditional rule-based logics.

What our little code example allowed us to do is having two very suggestive representations in one semantically rich space and one in normalized value space, mapping every record to a line in a graph that has direct comparison properties.

This allows the systems to identify complex patterns and relations that traditional data structures won’t be able to reflect adequately. With the geometric nature of vector spaces, the shape of these structures tells the stories of similarities, differences, and relationships, allowing for an inherently standardized yet flexible representation of complex data. 

But going from here, you will see this structure copied across other applications of vector-based customer analysis: use relevant data, aggregate it in a format we can work with, and meta representation combines heterogeneous data into a common understanding of vectors. Whether it’s recommendation systems, customer segmentation models, or predictive analytics tools, this fundamental approach to thoughtful vectorization will underpin all of it. Thus, this fundamental approach is significant to know and understand even if you consider yourself non-tech and more into the business side.

Just keep in mind — the key is considering what part of your data has meaningful signals and how to encode them in a way that preserves their relationships. It is nothing but following your business logic in another way of thinking other than algebra. A more modern, multi-dimensional way.


The Mathematics of Meaning (Kings and Queens)

Photo by Debbie Fan on Unsplash

All human communication delivers rich networks of meaning that our brains wire to make sense of automatically. These are meanings that we can capture mathematically, using vector-based computing; we can represent words in space so that they are points in a multi-dimensional word space. This geometrical treatment allows us to think in spatial terms about the abstract semantic relations we are interested in, as distances and directions.

For instance, the relationship “King is to Queen as Man is to Woman” is encoded in a vector space in such a way that the direction and distance between the words “King” and “Queen” are similar to those between the words “Man” and “Woman.”

Let’s take a step back to understand why this might be: the key component that makes this system work is word embeddings — numerical representations that encode words as vectors in a dense vector space. These embeddings are derived from examining co-occurrences of words across large snippets of text. Just as we learn that “dog” and “puppy” are related concepts by observing that they occur in similar contexts, embedding algorithms learn to embed these words close to each other in a vector space.

Word embeddings reveal their real power when we look at how they encode analogical relationships. Think about what we know about the relationship between “king” and “queen.” We can tell through intuition that these words are different in gender but share associations related to the palace, authority, and leadership. Through a wonderful property of vector space systems — vector arithmetic — this relationship can be captured mathematically.

One does this beautifully in the classic example:

vector('king') - vector('man') + vector('woman') ≈ vector('queen')

This equation tells us that if we have the vector for “king,” and we subtract out the “man” vector (we remove the concept of “male”), and then we add the “woman” vector (we add the concept of “female”), we get a new point in space very close to that of “queen.” That’s not some mathematical coincidence — it’s based on how the embedding space has arranged the meaning in a sort of structured way.

We can apply this idea of context in Python with pre-trained word embeddings:

import gensim.downloader as api

# Load a pre-trained model that contains word vectors learned from Google News
model = api.load('word2vec-google-news-300')

# Define our analogy words
source_pair = ('king', 'man')
target_word = 'woman'

# Find which word completes the analogy using vector arithmetic
result = model.most_similar(
    positive=[target_word, source_pair[0]], 
    negative=[source_pair[1]], 
    topn=1
)

# Display the result
print(f"{source_pair[0]} is to {source_pair[1]} as {target_word} is to {result[0][0]}")

The structure of this vector space exposes many basic principles:

  1. Semantic similarity is present as spatial proximity. Related words congregate: the neighborhoods of ideas. “Dog,” “puppy,” and “canine” would be one such cluster; meanwhile, “cat,” “kitten,” and “feline” would create another cluster nearby.
  2. Relationships between words become directions in the space. The vector from “man” to “woman” encodes a gender relationship, and other such relationships (for example, “king” to “queen” or “actor” to “actress”) typically point in the same direction.
  3. The magnitude of vectors can carry meaning about word importance or specificity. Common words often have shorter vectors than specialized terms, reflecting their broader, less specific meanings.

Working with relationships between words in this way gave us a geometric encoding of meaning and the mathematical precision needed to reflect the nuances of natural language processing to machines. Instead of treating words as separate symbols, vector-like systems can recognize patterns, make analogies, and even uncover relationships that were never programmed.

To better grasp what was just discussed I took the liberty to have the words we mentioned before (“King, Man, Women”; “Dog, Puppy, Canine”; “Cat, Kitten, Feline”) mapped to a corresponding 2D vector. These vectors numerically represent semantic meaning.

Visualization of the before-mentioned example terms as 2D word embeddings. Showing grouped categories for explanatory purposes. Data is fabricated and axes are simplified for educational purposes.
  • Human-related words have high positive values on both dimensions.
  • Dog-related words have negative x-values and positive y-values.
  • Cat-related words have positive x-values and negative y-values.

Be aware, those values are fabricated by me to illustrate better. As shown in the 2D Space where the vectors are plotted, you can observe groups based on the positions of the dots representing the vectors. The three dog-related words e.g. can be clustered as the “Dog” category etc. etc.

Grasping these basic principles gives us insight into both the capabilities and limitations of modern language AI, such as large language models (LLMs). Though these systems can do amazing analogical and relational gymnastics, they are ultimately cycles of geometric patterns based on the ways that words appear in proximity to one another in a body of text. An elaborate but, by definition, partial reflection of human linguistic comprehension. As such an Llm, since based on vectors, can only generate as output what it has received as input. Although that doesn’t mean it generates only what it has been trained 1:1, we all know about the fantastic hallucination capabilities of LLMs; it means that LLMs, unless specifically instructed, wouldn’t come up with neologisms or new language to describe things. This basic understanding is still lacking for a lot of business leaders that expect LLMs to be miracle machines unknowledgeable about the underlying principles of vectors.


A Tale of Distances, Angles, and Dinner Parties

Photo by OurWhisky Foundation on Unsplash

Now, let’s assume you’re throwing a dinner party and it’s all about Hollywood and the big movies, and you want to seat people based on what they like. You could just calculate “distance” between their preferences (genres, perhaps even hobbies?) and find out who should sit together. But deciding how you measure that distance can be the difference between compelling conversations and annoyed participants. Or awkward silences. And yes, that company party flashback is repeating itself. Sorry for that!

The same is true in the world of vectors. The distance metric defines how “similar” two vectors look, and therefore, ultimately, how well your system performs to predict an outcome.

Euclidean Distance: Straightforward, but Limited

Euclidean distance measures the straight-line distance between two points in space, making it easy to understand:

  • Euclidean distance is fine as long as vectors are physical locations.
  • However, in high-dimensional spaces (like vectors representing user behavior or preferences), this metric often falls short. Differences in scale or magnitude can skew results, focusing on scale over actual similarity.

Example: Two vectors might represent your dinner guests’ preferences for how much streaming services are used:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

While their preferences align, Euclidean distance would make them seem vastly different because of the disparity in overall activity.

But in higher-dimensional spaces, such as user behavior or textual meaning, Euclidean distance becomes increasingly less informative. It overweights magnitude, which can obscure comparisons. Consider two moviegoers: one has seen 200 action movies, the other has seen 10, but they both like the same genres. Because of their sheer activity level, the second viewer would appear much less similar to the first when using Euclidean distance though all they ever watched is Bruce Willis movies.

Cosine Similarity: Focused on Direction

The cosine similarity method takes a different approach. It focuses on the angle between vectors, not their magnitudes. It’s like comparing the path of two arrows. If they point the same way, they are aligned, no matter their lengths. This shows that it’s perfect for high-dimensional data, where we care about relationships, not scale.

  • If two vectors point in the same direction, they’re considered similar (cosine similarity approx of 1).
  • When opposing (so pointing in opposite directions), they differ (cosine similarity ≈ -1).
  • If they’re perpendicular (at a right angle of 90° to one another), they are unrelated (cosine similarity close to 0).

This normalizing property ensures that the similarity score correctly measures alignment, regardless of how one vector is scaled in comparison to another.

Example: Returning to our streaming preferences, let’s take a look at how our dinner guest’s preferences would look like as vectors:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

Let us discuss why cosine similarity is really effective in this case. So, when we compute cosine similarity for vec1 [5, 10, 5] and vec2 [1, 2, 1], we’re essentially trying to see the angle between these vectors.

The dot product normalizes the vectors first, dividing each component by the length of the vector. This operation “cancels” the differences in magnitude:

  • So for vec1: Normalization gives us [0.41, 0.82, 0.41] or so.
  • For vec2: Which resolves to [0.41, 0.82, 0.41] after normalization we will also have it.

And now we also understand why these vectors would be considered identical with regard to cosine similarity because their normalized versions are identical!

This tells us that even though dinner guest A views more total content, the proportion they allocate to any given genre perfectly mirrors dinner guest B’s preferences. It’s like saying both your guests dedicate 20% of their time to action, 60% to drama, and 20% to comedy, no matter the total hours viewed.

It’s this normalization that makes cosine similarity particularly effective for high-dimensional data such as text embeddings or user preferences.

When dealing with data of many dimensions (think hundreds or thousands of components of a vector for various features of a movie), it is often the relative significance of each dimension corresponding to the complete profile rather than the absolute values that matter most. Cosine similarity identifies precisely this arrangement of relative importance and is a powerful tool to identify meaningful relationships in complex data.


Hiking up the Euclidian Mountain Trail

Photo by Christian Mikhael on Unsplash

In this part, we will see how different approaches to measuring similarity behave in practice, with a concrete example from the real world and some little code example. Even if you are a non-techie, the code will be easy to understand for you as well. It’s to illustrate the simplicity of it all. No fear!

How about we quickly discuss a 10-mile-long hiking trail? Two friends, Alex and Blake, write trail reviews of the same hike, but each ascribes it a different character:

The trail gained 2,000 feet in elevation over just 2 miles! Easily doable with some high spikes in between!
Alex

and

Beware, we hiked 100 straight feet up in the forest terrain at the spike! Overall, 10 beautiful miles of forest!
Blake

These descriptions can be represented as vectors:

alex_description = [2000, 2]  # [elevation_gain, trail_distance]
blake_description = [100, 10]  # [elevation_gain, trail_distance]

Let’s combine both similarity measures and see what it tells us:

import numpy as np

def cosine_similarity(vec1, vec2):
    """
    Measures how similar the pattern or shape of two descriptions is,
    ignoring differences in scale. Returns 1.0 for perfectly aligned patterns.
    """
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

def euclidean_distance(vec1, vec2):
    """
    Measures the direct 'as-the-crow-flies' difference between descriptions.
    Smaller numbers mean descriptions are more similar.
    """
    return np.linalg.norm(np.array(vec1) - np.array(vec2))

# Alex focuses on the steep part: 2000ft elevation over 2 miles
alex_description = [2000, 2]  # [elevation_gain, trail_distance]

# Blake describes the whole trail: 100ft average elevation per mile over 10 miles
blake_description = [100, 10]  # [elevation_gain, trail_distance]

# Let's see how different these descriptions appear using each measure
print("Comparing how Alex and Blake described the same trail:")
print("\nEuclidean distance:", euclidean_distance(alex_description, blake_description))
print("(A larger number here suggests very different descriptions)")

print("\nCosine similarity:", cosine_similarity(alex_description, blake_description))
print("(A number close to 1.0 suggests similar patterns)")

# Let's also normalize the vectors to see what cosine similarity is looking at
alex_normalized = alex_description / np.linalg.norm(alex_description)
blake_normalized = blake_description / np.linalg.norm(blake_description)

print("\nAlex's normalized description:", alex_normalized)
print("Blake's normalized description:", blake_normalized)

So now, running this code, something magical happens:

Comparing how Alex and Blake described the same trail:

Euclidean distance: 8.124038404635959
(A larger number here suggests very different descriptions)

Cosine similarity: 0.9486832980505138
(A number close to 1.0 suggests similar patterns)

Alex's normalized description: [0.99975 0.02236]
Blake's normalized description: [0.99503 0.09950]

This output shows why, depending on what you are measuring, the same trail may appear different or similar.

The large Euclidean distance (8.12) suggests these are very different descriptions. It’s understandable that 2000 is a lot different from 100, and 2 is a lot different from 10. It’s like taking the raw difference between these numbers without understanding their meaning.

But the high Cosine similarity (0.95) tells us something more interestingboth descriptions capture a similar pattern.

If we look at the normalized vectors, we can see it, too; both Alex and Blake are describing a trail in which elevation gain is the prominent feature. The first number in each normalized vector (elevation gain) is much larger relative to the second (trail distance). Either that or elevating them both and normalizing based on proportion — not volume — since they both share the same trait defining the trail.

Perfectly true to life: Alex and Blake hiked the same trail but focused on different parts of it when writing their review. Alex focused on the steeper section and described a 100-foot climb, and Blake described the profile of the entire trail, averaged to 200 feet per mile over 10 miles. Cosine similarity identifies these descriptions as variations of the same basic trail pattern, whereas Euclidean distance regards them as completely different trails.

This example highlights the need to select the appropriate similarity measure. Normalizing and taking cosine similarity gives many meaningful correlations that are missed by just taking distances like Euclidean in real use cases.


Real-World Impacts of Metric Choices

Photo by fabio on Unsplash

The metric you pick doesn’t merely change the numbers; it influences the results of complex systems. Here’s how it breaks down in various domains:

  • In Recommendation Engines: When it comes to cosine similarity, we can group users who have the same tastes, even if they are doing different amounts of overall activity. A streaming service could use this to recommend movies that align with a user’s genre preferences, regardless of what is popular among a small subset of very active viewers.
  • In Document Retrieval: When querying a database of documents or research papers, cosine similarity ranks documents according to whether their content is similar in meaning to the user’s query, rather than their text length. This enables systems to retrieve results that are contextually relevant to the query, even though the documents are of a wide range of sizes.
  • In Fraud Detection: Patterns of behavior are often more important than pure numbers. Cosine similarity can be used to detect anomalies in spending habits, as it compares the direction of the transaction vectors — type of merchant, time of day, transaction amount, etc. — rather than the absolute magnitude.

And these differences matter because they give a sense of how systems “think”. Let’s get back to that credit card example one more time: It might, for example, identify a high-value $7,000 transaction for your new E-Bike as suspicious using Euclidean distance — even if that transaction is normal for you given you have an average spent of $20,000 a mont.

A cosine-based system, on the other hand, understands that the transaction is consistent with what the user typically spends their money on, thus avoiding unnecessary false notifications.

But measures like Euclidean distance and cosine similarity are not merely theoretical. They’re the blueprints on which real-world systems stand. Whether it’s recommendation engines or fraud detection, the metrics we choose will directly impact how systems make sense of relationships in data.

Vector Representations in Practice: Industry Transformations

Photo by Louis Reed on Unsplash

This ability for abstraction is what makes vector representations so powerful — they transform complex and abstract field data into concepts that can be scored and actioned. These insights are catalyzing fundamental transformations in business processes, decision-making, and customer value delivery across sectors.

Next, we will explore the solution use cases we are highlighting as concrete examples to see how vectors are freeing up time to solve big problems and creating new opportunities that have a big impact. I picked an industry to show what vector-based approaches to a challenge can achieve, so here is a healthcare example from a clinical setting. Why? Because it matters to us all and is rather easy to relate to than digging into the depths of the finance system, insurance, renewable energy, or chemistry.

Healthcare Spotlight: Pattern Recognition in Complex Medical Data

The healthcare industry poses a perfect storm of challenges that vector representations can uniquely solve. Think of the complexities of patient data: medical histories, genetic information, lifestyle factors, and treatment outcomes all interact in nuanced ways that traditional rule-based systems are incapable of capturing.

At Massachusetts General Hospital, researchers implemented a vector-based early detection system for sepsis, a condition in which every hour of early detection increases the chances of survival by 7.6% (see the full study at pmc.ncbi.nlm.nih.gov/articles/PMC6166236/).

In this new methodology, spontaneous neutrophil velocity profiles (SVP) are used to describe the movement patterns of neutrophils from a drop of blood. We won’t get too medically detailed here, because we’re vector-focused today, but a neutrophil is an immune cell that is kind of a first responder in what the body uses to fight off infections.

The system then encodes each neutrophil’s motion as a vector that captures not just its magnitude (i.e., speed), but also its direction. So they converted biological patterns to high-dimensional vector spaces; thus, they got subtle differences and showed that healthy individuals and sepsis patients exhibited statistically significant differences in movement. Then, these numeric vectors were processed with the help of a Machine Learning model that was trained to detect early signs of sepsis. The result was a diagnostic tool that reached impressive sensitivity (97%) and specificity (98%) to achieve a rapid and accurate identification of this fatal condition — probably with the cosine similarity (the paper doesn’t go into much detail, so this is pure speculation, but it would be the most suitable) that we just learned about a moment ago.

This is just one example of how medical data can be encoded into its vector representations and turned into malleable, actionable insights. This approach made it possible to re-contextualize complex relationships and, along with tread-based machine learning, worked around the limitations of previous diagnostic modalities and proved to be a potent tool for clinicians to save lives. It’s a powerful reminder that Vectors aren’t merely theoretical constructs — they’re practical, life-saving solutions that are powering the future of healthcare as much as your credit card risk detection software and hopefully also your business.


Lead and understand, or face disruption. The naked truth.

Photo by Hunters Race on Unsplash

With all you have read about by now: Think of a decision as small as the decision about the metrics under which data relationships are evaluated. Leaders risk making assumptions that are subtle yet disastrous. You are basically using algebra as a tool, and while getting some result, you cannot know if it is right or not: making leadership decisions without understanding the fundamentals of vectors is like calculating using a calculator but not knowing what formulas you are using.

The good news is this doesn’t mean that business leaders have to become data scientists. Vectors are delightful because, once the core ideas have been grasped, they become very easy to work with. An understanding of a handful of concepts (for example, how vectors encode relationships, why distance metrics are important, and how embedding models function) can fundamentally change how you make high-level decisions. These tools will help you ask better questions, work with technical teams more effectively, and make sound decisions about the systems that will govern your business.

The returns on this small investment in comprehension are huge. There is much talk about personalization. Yet, few organizations use vector-based thinking in their business strategies. It could help them leverage personalization to its full potential. Such an approach would delight customers with tailored experiences and build loyalty. You could innovate in areas like fraud detection and operational efficiency, leveraging subtle patterns in data that traditional ones miss — or perhaps even save lives, as described above. Equally important, you can avoid expensive missteps that happen when leaders defer to others for key decisions without understanding what they mean.

The truth is, vectors are here now, driving a vast majority of all the hyped AI technology behind the scenes to help create the world we navigate in today and tomorrow. Companies that do not adapt their leadership to think in vectors risk falling behind a competitive landscape that becomes ever more data-driven. One who adopts this new paradigm will not just survive but will prosper in an age of never-ending AI innovation.

Now is the moment to act. Start to view the world through vectors. Study their tongue, examine their doctrine, and ask how the new could change your tactics and your lodestars. Much in the way that algebra became an essential tool for writing one’s way through practical life challenges, vectors will soon serve as the literacy of the data age. Actually they do already. It is the future of which the powerful know how to take control. The question is not if vectors will define the next era of businesses; it is whether you are prepared to lead it.

The post The Invisible Revolution: How Vectors Are (Re)defining Business Success appeared first on Towards Data Science.

]]>
Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o https://towardsdatascience.com/deb8flow-orchestrating-autonomous-ai-debates-with-langgraph-and-gpt-4o/ Thu, 10 Apr 2025 05:14:56 +0000 https://towardsdatascience.com/?p=605704 Inside Deb8flow: Real-time AI debates with LangGraph and GPT-4o

The post Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o appeared first on Towards Data Science.

]]>
Introduction

I’ve always been fascinated by debates—the strategic framing, the sharp retorts, and the carefully timed comebacks. Debates aren’t just entertaining; they’re structured battles of ideas, driven by logic and evidence. Recently, I started wondering: could we replicate that dynamic using AI agents—having them debate each other autonomously, complete with real-time fact-checking and moderation? The result was Deb8flow, an autonomous AI debating environment powered by LangGraph, OpenAI’s GPT-4o model, and the new integrated Web Search feature.

In Deb8flow, two agents—Pro and Con—square off on a given topic while a Moderator manages turn-taking. A dedicated Fact Checker reviews every claim in real time using GPT-4o’s new browsing capabilities, and a final Judge evaluates the arguments for quality and coherence. If an agent repeatedly makes factual errors, they’re automatically disqualified—ensuring the debate stays grounded in truth.

This article offers an in-depth look at the advanced architecture and dynamic workflows that power autonomous AI debates. I’ll walk you through how Deb8flow’s modular design leverages LangGraph’s state management and conditional routing, alongside GPT-4o’s capabilities.

Even if you’re new to AI agents or LangGraph (see resources [1] and [2] for primers), I’ll explain the key concepts clearly. And if you’d like to explore further, the full project is available on GitHub: iason-solomos/Deb8flow.

Ready to see how AI agents can debate autonomously in practice?

Let’s dive in.

High-Level Overview: Autonomous Debates with Multiple Agents

In Deb8flow, we orchestrate a formal debate between two AI agents – one arguing Pro and one Con – complete with a Moderator, a Fact Checker, and a final Judge. The debate unfolds autonomously, with each agent playing a role in a structured format.

At its core, Deb8flow is a LangGraph-powered agent system, built atop LangChain, using GPT-4o to power each role—Pro, Con, Judge, and beyond. We use GPT-4o’s preview model with browsing capabilities to enable real-time fact-checking. In essence, the Pro and Con agents debate; after each statement, a fact-checker agent uses GPT-4o’s web search to catch any hallucinations or inaccuracies in that statement in real time.​ The debate only continues once the statement is verified. The whole process is coordinated by a LangGraph-defined workflow that ensures proper turn-taking and conditional logic.


High-level debate flow graph. Each rectangle is an agent node (Pro/Con debaters, Fact Checker, Judge, etc.), and diamonds are control nodes (Moderator and a router after fact-checking). Solid arrows denote the normal progression, while dashed arrows indicate retries if a claim fails fact-check. The Judge node outputs the final verdict, then the workflow ends.
Image generated by the author with DALL-E

The debate workflow goes through these stages:

  • Topic Generation: A Topic Generator agent produces a nuanced, debatable topic for the session (e.g. “Should AI be used in classroom education?”).
  • Opening: The Pro Argument Agent makes an opening statement in favor of the topic, kicking off the debate.
  • Rebuttal: The Debate Moderator then gives the floor to the Con Argument agent, who rebuts the Pro’s opening statement.
  • Counter: The Moderator gives the floor back to the Pro agent, who counters the Con agent’s points.
  • Closing: The Moderator switches the floor to the Con agent one last time for a closing argument.
  • Judgment: Finally, the Judge agent reviews the full debate history and evaluates both sides based on argument quality, clarity, and persuasiveness. The most convincing side wins.

After every single speech, the Fact Checker agent steps in to verify the factual accuracy of that statement​. If a debater’s claim doesn’t hold up (e.g. cites a wrong statistic or “hallucinates” a fact), the workflow triggers a retry: the speaker has to correct or modify their statement. (If either debater accumulates 3 fact-check failures, they are automatically disqualified for repeatedly spreading inaccuracies, and their opponent wins by default.) This mechanism keeps our AI debaters honest and grounded in reality!

Prerequisites and Setup

Before diving into the code, make sure you have the following in place:

  • Python 3.12+ installed.
  • An OpenAI API key with access to the GPT-4o model. You can create your own API key here: https://platform.openai.com/settings/organization/api-keys
  • Project Code: Clone the Deb8flow repository from GitHub (git clone https://github.com/iason-solomos/Deb8flow.git). The repo includes a requirements.txt for all required packages. Key dependencies include LangChain/LangGraph (for building the agent graph) and the OpenAI Python client.
  • Install Dependencies: In your project directory, run: pip install -r requirements.txt to install the necessary libraries.
  • Create a .env file in the project root to hold your OpenAI API credentials. It should be of the form: OPENAI_API_KEY_GPT4O = "sk-…"
  • You can also at any time check out the README file: https://github.com/iason-solomos/Deb8flow if you simply want to run the finished app.

Once dependencies are installed and the environment variable is set, you should be ready to run the app. The project structure is organized for clarity:

Deb8flow/
├── configurations/
│ ├── debate_constants.py
│ └── llm_config.py
├── nodes/
│ ├── base_component.py
│ ├── topic_generator_node.py
│ ├── pro_debater_node.py
│ ├── con_debater_node.py
│ ├── debate_moderator_node.py
│ ├── fact_checker_node.py
│ ├── fact_check_router_node.py
│ └── judge_node.py
├── prompts/
│ ├── topic_generator_prompts.py
│ ├── pro_debater_prompts.py
│ ├── con_debater_prompts.py
│ └── … (prompts for other agents)
├── tests/ (contains unit and whole workflow tests)
└── debate_workflow.py

A quick tour of this structure:

configurations/ holds constant definitions and LLM configuration classes.

nodes/ contains the implementation of each agent or functional node in the debate (each of these is a module defining one agent’s behavior).

prompts/ stores the prompt templates for the language model (so each agent knows how to prompt GPT-4o for its specific task).

debate_workflow.py ties everything together by defining the LangGraph workflow (the graph of nodes and transitions).

debate_state.py defines the shared data structure that the agents will be using on each run.

tests/ includes some basic tests and example runs to help you verify everything is working.

Under the Hood: State Management and Workflow Setup

To coordinate a complex multi-turn debate, we need a shared state and a well-defined flow. We’ll start by looking at how Deb8flow defines the debate state and constants, and then see how the LangGraph workflow is constructed.

Defining the Debate State Schema (debate_state.py)

Deb8flow uses a shared state (https://langchain-ai.github.io/langgraph/concepts/low_level/#state ) in the form of a Python TypedDict that all agents can read from and update. This state tracks the debate’s progress and context – things like the topic, the history of messages, whose turn it is, etc. By centralizing this information, each agent node can make decisions based on the current state of the debate.

Link: debate_state.py

from typing import TypedDict, List, Dict, Literal


DebateStage = Literal["opening", "rebuttal", "counter", "final_argument"]

class DebateMessage(TypedDict):
    speaker: str  # e.g. pro or con
    content: str  # The message each speaker produced
    validated: bool  # Whether the FactChecker ok’d this message
    stage: DebateStage # The stage of the debate when this message was produced

class DebateState(TypedDict):
    debate_topic: str
    positions: Dict[str, str]
    messages: List[DebateMessage]
    opening_statement_pro_agent: str
    stage: str  # "opening", "rebuttal", "counter", "final_argument"
    speaker: str  # "pro" or "con"
    times_pro_fact_checked: int # The number of times the pro agent has been fact-checked. If it reaches 3, the pro agent is disqualified.
    times_con_fact_checked: int # The number of times the con agent has been fact-checked. If it reaches 3, the con agent is disqualified.

Key fields that we need to have in the DebateState include:

  • debate_topic (str): The topic being debated.
  • messages (List[DebateMessage]): A list of all messages exchanged so far. Each message is a dictionary with fields for speaker (e.g. "pro" or "con" or "fact_checker"), the message content (text), a validated flag (whether it passed fact-check), and the stage of the debate when it was produced.
  • stage (str): The current debate stage (one of "opening", "rebuttal", "counter", "final_argument").
  • speaker (str): Whose turn it is currently ("pro" or "con").
  • times_pro_fact_checked / times_con_fact_checked (int): Counters for how many times each side has been caught with a false claim. (In our rules, if a debater fails fact-check 3 times, they could be disqualified or automatically lose.)
  • positions (Dict[str, str]): (Optional) A mapping of each side’s general stance (e.g., "pro": "In favor of the topic").

By structuring the debate’s state, agents find it easy to access the conversation history or check the current stage, and the control logic can update the state between turns. The state is essentially the memory of the debate.

Constants and Configuration

To avoid “magic strings” scattered in the code, we define some constants in debate_constants.py. For example, constants for stage names (STAGE_OPENING = "opening", etc.), speaker identifiers (SPEAKER_PRO = "pro", SPEAKER_CON = "con", etc.), and node names (NODE_PRO_DEBATER = "pro_debater_node", etc.). These make the code easier to maintain and read.

debate_constants.py:

# Stage names
STAGE_OPENING = "opening"
STAGE_REBUTTAL = "rebuttal"
STAGE_COUNTER = "counter"
STAGE_FINAL_ARGUMENT = "final_argument"
STAGE_END = "end"

# Speakers
SPEAKER_PRO = "pro"
SPEAKER_CON = "con"
SPEAKER_JUDGE = "judge"

# Node names
NODE_PRO_DEBATER = "pro_debater_node"
NODE_CON_DEBATER = "con_debater_node"
NODE_DEBATE_MODERATOR = "debate_moderator_node"
NODE_JUDGE = "judge_node"

We also set up LLM configuration in llm_config.py. Here, we define classes for OpenAI or Azure OpenAI configs and then create a dictionary llm_config_map mapping model names to their config. For instance, we map "gpt-4o" to an OpenAILLMConfig that holds the model name and API key. This way, whenever we need to initialize a GPT-4o agent, we can just do llm_config_map["gpt-4o"] to get the right config. All our main agents (debaters, topic generator, judge) use this same GPT-4o configuration.

import os
from dataclasses import dataclass
from typing import Union

@dataclass
class OpenAILLMConfig:
    """
    A data class to store configuration details for OpenAI models.

    Attributes:
        model_name (str): The name of the OpenAI model to use.
        openai_api_key (str): The API key for authenticating with the OpenAI service.
    """
    model_name: str
    openai_api_key: str


llm_config_map = {
    "gpt-4o": OpenAILLMConfig(
        model_name="gpt-4o",
        openai_api_key=os.getenv("OPENAI_API_KEY_GPT4O"),
    )
}

Building the LangGraph Workflow (debate_workflow.py)

With state and configs in place, we construct the debate workflow graph. LangGraph’s StateGraph is the backbone that connects all our agent nodes in the order they should execute. Here’s how we set it up:

class DebateWorkflow:

    def _initialize_workflow(self) -> StateGraph:
        workflow = StateGraph(DebateState)
        # Nodes
        workflow.add_node("generate_topic_node", GenerateTopicNode(llm_config_map["gpt-4o"]))
        workflow.add_node("pro_debater_node", ProDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("con_debater_node", ConDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("fact_check_node", FactCheckNode())
        workflow.add_node("fact_check_router_node", FactCheckRouterNode())
        workflow.add_node("debate_moderator_node", DebateModeratorNode())
        workflow.add_node("judge_node", JudgeNode(llm_config_map["gpt-4o"]))

        # Entry point
        workflow.set_entry_point("generate_topic_node")

        # Flow
        workflow.add_edge("generate_topic_node", "pro_debater_node")
        workflow.add_edge("pro_debater_node", "fact_check_node")
        workflow.add_edge("con_debater_node", "fact_check_node")
        workflow.add_edge("fact_check_node", "fact_check_router_node")
        workflow.add_edge("judge_node", END)
        return workflow



    async def run(self):
        workflow = self._initialize_workflow()
        graph = workflow.compile()
        # graph.get_graph().draw_mermaid_png(output_file_path="workflow_graph.png")
        initial_state = {
            "topic": "",
            "positions": {}
        }
        final_state = await graph.ainvoke(initial_state, config={"recursion_limit": 50})
        return final_state

Let’s break down what’s happening:

  • We initialize a new StateGraph with our DebateState type as the state schema.
  • We add each node (agent) to the graph with a name. For nodes that need an LLM, we pass in the GPT-4o config. For example, "pro_debater_node" is added as ProDebaterNode(llm_config_map["gpt-4o"]), meaning the Pro debater agent will use GPT-4o as its underlying model.
  • We set the entry point of the graph to "generate_topic_node". This means the first step of the workflow is to generate a debate topic.
  • Then we add directed edges to connect nodes. The edges above encode the primary sequence: topic -> pro’s turn -> fact-check -> (then a routing decision) -> … eventually -> judge -> END. We don’t connect the Moderator or Fact Check Router with static edges, since these nodes use dynamic commands to redirect the flow. The final edge connects the judge to an END marker to terminate the graph.

When the workflow runs, control will pass along these edges in order, but whenever we hit a router or moderator node, that node will output a command telling the graph which node to go to next (overriding the default edge). This is how we create conditional loops: the fact_check_router_node might send us back to a debater node for a retry, instead of following a straight line. LangGraph supports this by allowing nodes to return a special Command object with goto instructions.

In summary, at a high level we’ve defined an agentic workflow: a graph of autonomous agents where control can branch and loop based on the agents’ outputs. Now, let’s explore what each of these agent nodes actually does.

Agent Nodes Breakdown

Each stage or role in the debate is encapsulated in a node (agent). In LangGraph, nodes are often simple functions, but I wanted a more object-oriented approach for clarity and reusability. So in Deb8flow, every node is a class with a __call__ method. All the main agent classes inherit from a common BaseComponent for shared functionality. This design makes the system modular: we can easily swap out or extend agents by modifying their class definitions, and each agent class is responsible for its piece of the workflow.

Let’s go through the key agents one by one.

BaseComponent – A Reusable Agent Base Class

Most of our agent nodes (like the debaters and judge) share common needs: they use an LLM to generate output, they might need to retry on errors, and they should track token usage. The BaseComponent class (defined in <a href="https://github.com/iason-solomos/Deb8flow/blob/main/nodes/base_component.py">nodes/base_component.py</a>) provides these common features so we don’t repeat code.

class BaseComponent:
    """
    A foundational class for managing LLM-based workflows with token tracking.
    Can handle both Azure OpenAI (AzureChatOpenAI) and OpenAI (ChatOpenAI).
    """

    def __init__(
        self,
        llm_config: Optional[LLMConfig] = None,
        temperature: float = 0.0,
        max_retries: int = 5,
    ):
        """
        Initializes the BaseComponent with optional LLM configuration and temperature.

        Args:
            llm_config (Optional[LLMConfig]): Configuration for either Azure or OpenAI.
            temperature (float): Controls the randomness of LLM outputs. Defaults to 0.0.
            max_retries (int): How many times to retry on 429 errors.
        """
        logger = logging.getLogger(self.__class__.__name__)
        tracer = trace.get_tracer(__name__, tracer_provider=get_tracer_provider())

        self.logger = logger
        self.tracer = tracer
        self.llm: Optional[ChatOpenAI] = None
        self.output_parser: Optional[StrOutputParser] = None
        self.state: Optional[DebateState] = None
        self.prompt_template: Optional[ChatPromptTemplate] = None
        self.chain: Optional[RunnableSequence] = None
        self.documents: Optional[List] = None
        self.prompt_tokens = 0
        self.completion_tokens = 0
        self.max_retries = max_retries

        if llm_config is not None:
            self.llm = self._init_llm(llm_config, temperature)
            self.output_parser = StrOutputParser()

    def _init_llm(self, config: LLMConfig, temperature: float):
        """
        Initializes an LLM instance for either Azure OpenAI or standard OpenAI.
        """
        if isinstance(config, AzureOpenAILLMConfig):
            # If it's Azure, use the AzureChatOpenAI class
            return AzureChatOpenAI(
                deployment_name=config.deployment_name,
                azure_endpoint=config.azure_endpoint,
                openai_api_version=config.openai_api_version,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        elif isinstance(config, OpenAILLMConfig):
            # If it's standard OpenAI, use the ChatOpenAI class
            return ChatOpenAI(
                model_name=config.model_name,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        else:
            raise ValueError("Unsupported LLMConfig type.")

    def validate_initialization(self) -> None:
        """
        Ensures we have an LLM and an output parser.
        """
        if not self.llm:
            raise ValueError("LLM is not initialized. Ensure `llm_config` is provided.")
        if not self.output_parser:
            raise ValueError("Output parser is not initialized.")

    def execute_chain(self, inputs: Any) -> Any:
        """
        Executes the LLM chain, tracks token usage, and retries on 429 errors.
        """
        if not self.chain:
            raise ValueError("No chain is initialized for execution.")

        retry_wait = 1  # Initial wait time in seconds

        for attempt in range(self.max_retries):
            try:
                with get_openai_callback() as cb:
                    result = self.chain.invoke(inputs)
                    self.logger.info("Prompt Token usage: %s", cb.prompt_tokens)
                    self.logger.info("Completion Token usage: %s", cb.completion_tokens)
                    self.prompt_tokens = cb.prompt_tokens
                    self.completion_tokens = cb.completion_tokens

                return result

            except Exception as e:
                # If the error mentions 429, do exponential backoff and retry
                if "429" in str(e):
                    self.logger.warning(
                        f"Rate limit reached. Retrying in {retry_wait} seconds... "
                        f"(Attempt {attempt + 1}/{self.max_retries})"
                    )
                    time.sleep(retry_wait)
                    retry_wait *= 2
                else:
                    self.logger.error(f"Unexpected error: {str(e)}")
                    raise e

        raise Exception("API request failed after maximum number of retries")

    def create_chain(
        self, system_template: str, human_template: str
    ) -> RunnableSequence:
        """
        Creates a chain for unstructured outputs.
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm | self.output_parser
        return self.chain

    def create_structured_output_chain(
        self, system_template: str, human_template: str, output_model: Type[BaseModel]
    ) -> RunnableSequence:
        """
        Creates a chain that yields structured outputs (parsed into a Pydantic model).
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm.with_structured_output(output_model)
        return self.chain

    def build_return_with_tokens(self, node_specific_data: dict) -> dict:
        """
        Convenience method to add token usage info into the return values.
        """
        return {
            **node_specific_data,
            "prompt_tokens": self.prompt_tokens,
            "completion_tokens": self.completion_tokens,
        }

    def __call__(self, state: DebateState) -> None:
        """
        Updates the node's local copy of the state.
        """
        self.state = state
        for key, value in state.items():
            setattr(self, key, value)

Key features of BaseComponent:

  • It stores an LLM client (e.g. an OpenAI ChatOpenAI instance) initialized with a given model and API key, as well as an output parser.
  • It provides a method create_chain(system_template, human_template) which sets up a LangChain prompt chain (a RunnableSequence) combining a system prompt and a human prompt. This chain is what actually generates outputs when run.
  • It has an execute_chain(inputs) method that invokes the chain and includes logic to retry if the OpenAI API returns a rate-limit error (HTTP 429). This is done with exponential backoff up to a max_retries count.
  • It keeps track of token usage (prompt tokens and completion tokens) for logging or analysis.
  • The __call__ method of BaseComponent (which each subclass will call via super().__call__(state)) can perform any setup needed before the node’s main logic runs (like ensuring the LLM is initialized).

By building on BaseComponent, each agent class can focus on its unique logic (like what prompt to use and how to handle the state), while inheriting the heavy lifting of interacting with GPT-4o reliably.

Topic Generator Agent (GenerateTopicNode)

The Topic Generator (topic_generator_node.py) is the first agent in the graph. Its job is to come up with a debatable topic for the session. We give it a prompt that instructs it to output a nuanced topic that could reasonably have a pro and con side.

This agent inherits from BaseComponent and uses a prompt chain (system + human prompt) to generate one item of text – the debate topic. When called, it executes the chain (with no special input, just using the prompt) and gets back a topic_text. It then updates the state with:

  • debate_topic: the generated topic (stripped of any extra whitespace),
  • positions: a dictionary assigning the pro and con stances (by default we use "In favor of the topic" and "Against the topic"),
  • stage: set to "opening",
  • speaker: set to "pro" (so the Pro side will speak first).

In code, the return might look like:

return {
    "debate_topic": debate_topic,
    "positions": positions,
    "stage": "opening",
    "speaker": first_speaker  # "pro"
}

Here are the prompts for the topic generator:

SYSTEM_PROMPT = """\
You are a brainstorming AI that suggests debate topics.
You will provide a single, interesting or timely topic that can have two opposing views.
"""

HUMAN_PROMPT = """\
Please suggest one debate topic for two AI agents to discuss.
For example, it could be about technology, politics, philosophy, or any interesting domain.
Just provide the topic in a concise sentence.
"""

Then we pass these prompts in the constructor of the class itself.

class GenerateTopicNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        # Create the prompt chain.
        self.chain: RunnableSequence = self.create_chain(
            system_template=SYSTEM_PROMPT,
            human_template=HUMAN_PROMPT
        )

    def __call__(self, state: DebateState) -> Dict[str, str]:
        """
        Generates a debate topic and assigns positions to the two debaters.
        """
        super().__call__(state)

        topic_text = self.execute_chain({})

        # Store the topic and assign stances in the DebateState
        debate_topic = topic_text.strip()
        positions = {
            "pro": "In favor of the topic",
            "con": "Against the topic"
        }

        
        first_speaker = "pro"
        self.logger.info("Welcome to our debate panel! Today's debate topic is: %s", debate_topic)
        return {
            "debate_topic": debate_topic,
            "positions": positions,
            "stage": "opening",
            "speaker": first_speaker
        }

It’s a pattern we will repeat for all classes except for those not using LLMs and the fact checker.

Now we can implement the 2 stars of the show, the Pro and Con argument agents!

Debater Agents (Pro and Con)

Link: pro_debater_node.py

The two debater agents are very similar in structure, but each uses different prompt templates tailored to their role (pro vs con) and the stage of the debate.

The Pro debater, for example, has to handle an opening statement and a counter-argument (countering the Con’s rebuttal). We also need logic for retries in case a statement fails fact-check. In code, the ProDebater class sets up multiple prompt chains:

  • opening_chain and an opening_retry_chain (using slightly different human prompts – the retry prompt might instruct it to try again without repeating any factually dubious claims).
  • counter_chain and counter_retry_chain for the counter-argument stage.
class ProDebaterNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        self.opening_chain = self.create_chain(SYSTEM_PROMPT, OPENING_HUMAN_PROMPT)
        self.opening_retry_chain = self.create_chain(SYSTEM_PROMPT, OPENING_RETRY_HUMAN_PROMPT)
        self.counter_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_HUMAN_PROMPT)
        self.counter_retry_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_RETRY_HUMAN_PROMPT)

    def __call__(self, state: DebateState) -> Dict[str, Any]:
        super().__call__(state)

        debate_topic = state.get("debate_topic")
        messages = state.get("messages", [])
        stage = state.get("stage")
        speaker = state.get("speaker")

        # Check if retrying (last message was by pro and not validated)
        last_msg = messages[-1] if messages else None
        retrying = last_msg and last_msg["speaker"] == SPEAKER_PRO and not last_msg["validated"]

        if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            chain = self.opening_retry_chain if retrying else self.opening_chain # select which chain we are triggering: the normal one or the fact-cehcked one
            result = chain.invoke({
                "debate_topic": debate_topic
            })
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            opponent_msg = self._get_last_message_by(SPEAKER_CON, messages)
            debate_history = get_debate_history(messages)
            chain = self.counter_retry_chain if retrying else self.counter_chain
            result = chain.invoke({
                "debate_topic": debate_topic,
                "opponent_statement": opponent_msg,
                "debate_history": debate_history
            })
        else:
            raise ValueError(f"Unknown turn for ProDebater: stage={stage}, speaker={speaker}")
        new_message = create_debate_message(speaker=SPEAKER_PRO, content=result, stage=stage)
        self.logger.info("Speaker: %s, Stage: %s, Retry: %s\nMessage:\n%s", speaker, stage, retrying, result)
        return {
            "messages": messages + [new_message]
        }

    def _get_last_message_by(self, speaker_prefix, messages):
        for m in reversed(messages):
            if m.get("speaker") == speaker_prefix:
                return m["content"]
        return ""

When the ProDebater’s __call__ runs, it looks at the current stage and speaker in the state to decide what to do:

  • If it’s the opening stage and the speaker is “pro”, it uses the opening_chain to generate an opening argument. If the last message from Pro was marked invalid (not validated), it knows this is a retry, so it would use the opening_retry_chain instead.
  • If it’s the counter stage and speaker is “pro”, it generates a counter-argument to whatever the opponent (Con) just said. It will fetch the last message by the Con from the messages history, and feed that into the prompt (so that the Pro can directly counter it). Again, if the last Pro message was invalid, it would switch to the retry chain.

After generating its argument, the Debater agent creates a new message entry (with speaker="pro", the content text, validated=False initially, and the stage) and appends it to the state’s message list. That becomes the output of the node (LangGraph will merge this partial state update into the global state).

The Con Debater agent mirrors this logic for its stages:

It similarly appends its message to the state.

It has a rebuttal and closing argument (final argument) stage, each with a normal and a retry chain.

It checks if it’s the rebuttal stage (speaker “con”) or final argument stage (speaker “con”) and invokes the appropriate chain, possibly using the last Pro message for context when rebutting.

con_debater_node.py

By using class-based implementation, our debaters’ code is easier to maintain. We can clearly separate what the Pro does vs what the Con does, even if they share structure. Also, by encapsulating prompt chains inside the class, each debater can manage multiple possible outputs (regular vs retry) cleanly.

Prompt design: The actual prompts (in prompts/pro_debater_prompts.py and con_debater_prompts.py) guide the GPT-4o model to take on a persona (“You are a debater arguing for/against the topic…”) and produce the argument. They also instruct the model to keep statements factual and logical. If a fact check fails, the retry prompt may say something like: “Your previous statement had an unverified claim. Revise your argument to be factually correct while maintaining your position.” – encouraging the model to correct itself.

With this, our AI debaters can engage in a multi-turn duel, and even recover from factual missteps.

Fact Checker Agent (FactCheckNode)

After each debater speaks, the Fact Checker agent swoops in to verify their claims. This agent is implemented in <a href="https://github.com/iason-solomos/Deb8flow/blob/main/nodes/fact_checker_node.py">fact_checker_node.py</a>, and interestingly, it uses the GPT-4o model’s browsing ability rather than our own custom prompts. Essentially, we delegate the fact-checking to OpenAI’s GPT-4 with web search.

How does this work? The OpenAI Python client for GPT-4 (with browsing) allows us to send a user message and get a structured response. In FactCheckNode.__call__, we do something like:

completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-search-preview",
            web_search_options={},
            messages=[{
                "role": "user",
                "content": (
                        f"Consider the following statement from a debate. "
                        f"If the statement contains numbers, or figures from studies, fact-check it online.\n\n"
                        f"Statement:\n\"{claim}\"\n\n"
                        f"Reply clearly whether any numbers or studies might be inaccurate or hallucinated, and why."
                        f"\n"
                        f"If the statement doesn't contain references to studies or numbers cited, don't go online to fact-check, and just consider it successfully fact-checked, with a 'yes' score.\n\n"
                )
            }],
            response_format=FactCheck
        )

If the result is “yes” (meaning the claim seems truthful or at least not factually wrong), the Fact Checker will mark the last message’s validated field as True in the state, and output {"validated": True} with no further changes. This signals that the debate can continue normally.

If the result is “no” (meaning it found the claim to be incorrect or dubious), the Fact Checker will append a new message to the state with speaker="fact_checker" describing the finding (or we could simply mark it, but providing a brief note like “(Fact Checker: The statistic cited could not be verified.)” can be useful). It will also set validated: False and increment a counter for whichever side made the claim. The output state from this node includes validated: False and an updated times_pro_fact_checked or times_con_fact_checked count.

We also use a Pydantic BaseModel to control the output of the LLM:

class FactCheck(BaseModel):
    """
    Pydantic model for the fact checking the claims made by debaters.

    Attributes:
        binary_score (str): 'yes' if the claim is verifiable and truthful, 'no' otherwise.
    """

    binary_score: str = Field(
        description="Indicates if the claim is verifiable and truthful. 'yes' or 'no'."
    )
    justification: str = Field(
        description="Explanation of the reasoning behind the score."
    )

Debate Moderator Agent (DebateModeratorNode)

The Debate Moderator is the conductor of the debate. Instead of producing lengthy text, this agent’s job is to manage turn-taking and stage progression. In the workflow, after a statement is validated by the Fact Checker, control passes to the Moderator node. The Moderator then issues a Command that updates the state for the next turn and directs the flow to the appropriate next agent.

The logic in DebateModeratorNode.__call__ (see <a href="https://github.com/iason-solomos/Deb8flow/blob/main/nodes/debate_moderator_node.py">nodes/debate_moderator_node.py</a>) goes roughly like this:

if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_REBUTTAL, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_REBUTTAL and speaker == SPEAKER_CON:
            return Command(
                update={"stage": STAGE_COUNTER, "speaker": SPEAKER_PRO},
                goto=NODE_PRO_DEBATER
            )
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_FINAL_ARGUMENT, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_FINAL_ARGUMENT and speaker == SPEAKER_CON:
            return Command(
                update={},
                goto=NODE_JUDGE
            )

        raise ValueError(f"Unexpected stage/speaker combo: stage={stage}, speaker={speaker}")

Each conditional corresponds to a point in the debate where a turn just ended, and sets up the next turn. For example, after the opening (Pro just spoke), it sets stage to rebuttal, switches speaker to Con, and directs the workflow to the Con debater node​. After the final_argument (Con’s closing), it directs to the Judge with no further update (the debate stage effectively ends).

Fact Check Router (FactCheckRouterNode)

This is another control node (like the Moderator) that introduces conditional logic. The Fact Check Router sits right after the Fact Checker agent in the flow. Its purpose is to branch the workflow depending on the fact-check result.

In <a href="https://github.com/iason-solomos/Deb8flow/blob/main/nodes/fact_check_router_node.py">nodes/fact_check_router_node.py</a>, the logic is:

if pro_fact_checks >= 3 or con_fact_checks >= 3:
            disqualified = SPEAKER_PRO if pro_fact_checks >= 3 else SPEAKER_CON
            winner = SPEAKER_CON if disqualified == SPEAKER_PRO else SPEAKER_PRO

            verdict_msg = {
                "speaker": "moderator",
                "content": (
                    f"Debate ended early due to excessive factual inaccuracies.\n\n"
                    f"DISQUALIFIED: {disqualified.upper()} (exceeded fact check limit)\n"
                    f"WINNER: {winner.upper()}"
                ),
                "validated": True,
                "stage": "verdict"
            }
            return Command(
                update={"messages": messages + [verdict_msg]},
                goto=END
            )
        if last_message.get("validated"):
            return Command(goto=NODE_DEBATE_MODERATOR)
        elif speaker == SPEAKER_PRO:
            return Command(goto=NODE_PRO_DEBATER)
        elif speaker == SPEAKER_CON:
            return Command(goto=NODE_CON_DEBATER)
        raise ValueError("Unable to determine routing in FactCheckRouterNode.")

First, the Fact Check Router checks if either side’s fact-check count has reached 3. If so, it creates a Moderator-style message announcing an early end: the offending side is disqualified and the other side is the winner​. It appends this verdict to the messages and returns a Command that jumps to END, effectively terminating the debate without going to the Judge (because we already know the outcome).

If we’re not ending the debate early, it then looks at the Fact Checker’s result for the last message (which is stored as validated on that message). If validated is True, we go to the debate moderator: Command(goto=debate_moderator_node).

Else if the statement fails fact-check, the workflow goes back to the debater to produce a revised statement (with the state counters updated to reflect the failure). This loop can happen multiple times if needed (up to the disqualification limit).

This dynamic control is the heart of Deb8flow’s “agentic” nature – the ability to adapt the path of execution based on the content of the agents’ outputs. It showcases LangGraph’s strength: combining control flow with state. We’re essentially encoding debate rules (like allowing retries for false claims, or ending the debate if someone cheats too often) directly into the workflow graph.

Judge Agent (JudgeNode)

Last but not least, the Judge agent delivers the final verdict based on rhetorical skill, clarity, structure, and overall persuasiveness. Its system prompt and human prompt make this explicit:

  • System Prompt: “You are an impartial debate judge AI. … Evaluate which debater presented their case more clearly, persuasively, and logically. You must focus on communication skills, structure of argument, rhetorical strength, and overall coherence.”
  • Human Prompt: “Here is the full debate transcript. Please analyze the performance of both debaters—PRO and CON. Evaluate rhetorical performance—clarity, structure, persuasion, and relevance—and decide who presented their case more effectively.”

When the Judge node runs, it receives the entire debate transcript (all validated messages) alongside the original topic. It then uses GPT-4o to examine how each side framed their arguments, handled counterpoints, and supported (or failed to support) claims with examples or logic. Crucially, the Judge is forbidden to evaluate which position is objectively correct (or who it thinks might be correct)—only who argued more persuasively.

Below is an example final verdict from a Deb8flow run on the topic:
“Should governments implement a universal basic income in response to increasing automation in the workforce?”

WINNER: PRO

REASON: The PRO debater presented a more compelling and rhetorically effective case for universal basic income. Their arguments were well-structured, beginning with a clear statement of the issue and the necessity of UBI in response to automation. They effectively addressed potential counterarguments by highlighting the unprecedented speed and scope of current technological changes, which distinguishes the current situation from past technological shifts. The PRO also provided empirical evidence from UBI pilot programs to counter the CON's claims about work disincentives and economic inefficiencies, reinforcing their argument with real-world examples.

In contrast, the CON debater, while presenting valid concerns about UBI, relied heavily on historical analogies and assumptions about workforce adaptability without adequately addressing the unique challenges posed by modern automation. Their arguments about the fiscal burden and potential inefficiencies of UBI were less supported by specific evidence compared to the PRO's rebuttals.

Overall, the PRO's arguments were more coherent, persuasive, and backed by empirical evidence, making their case more convincing to a neutral observer.

Langsmith Tracing

Throughout Deb8flow’s development, I relied on LangSmith (LangChain’s tracing and observability toolkit) to ensure the entire debate pipeline was behaving correctly. Because we have multiple agents passing control between themselves, it’s easy for unexpected loops or misrouted states to occur. LangSmith provides a convenient way to:

  • Visualize Execution Flow: You can see each agent’s prompt, the tokens consumed (so you can also track costs), and any intermediate states. This makes it much simpler to confirm that, say, the Con Debater is properly referencing the Pro Debater’s last message, or that the Fact Checker is accurately receiving the claim to verify.
  • Debug State Updates: If the Moderator or Fact Check Router is sending the flow to the wrong node, the trace will highlight that mismatch. You can trace which agent was invoked at each step and why, helping you spot stage or speaker misalignments early.
  • Track Prompt and Completion Tokens: With multiple GPT-4o calls, it’s useful to see how many tokens each stage is using, which LangSmith logs automatically if you enable tracing.

Integrating LangSmith is unexpectedly easy. You will just need to provide these 3 keys in your .env file: LANGCHAIN_API_KEY

LANGCHAIN_TRACING_V2

LANGCHAIN_PROJECT

Then you can open the LangSmith UI to see a structured trace of each run. This greatly reduces the guesswork involved in debugging multi-agent systems and is, in my experience, essential for more complex AI orchestration like ours. Example of a single run:

The trace in waterfall mode in Lansmith of one run, showing how the whole flow ran. Source: Generated by the author using Langsmith.

Reflections and Next Steps

Building Deb8flow was an eye-opening exercise in orchestrating autonomous agent workflows. We didn’t just chain a single model call – we created an entire debate simulation with AI agents, each with a specific role, and allowed them to interact according to a set of rules. LangGraph provided a clear framework to define how data and control flows between agents, making the complex sequence manageable in code. By using class-based agents and a shared state, we maintained modularity and clarity, which will pay off for any software engineering project in the long run.

An exciting aspect of this project was seeing emergent behavior. Even though each agent follows a script (a prompt), the unscripted combination – a debater trying to deceive, a fact-checker catching it, the debater rephrasing – felt surprisingly realistic! It’s a small step toward more Agentic Ai systems that can perform non-trivial multi-step tasks with oversight on each other.

There’s plenty of ideas for improvement:

  • User Interaction: Currently it’s fully autonomous, but one could add a mode where a human provides the topic or even takes the role of one side against an AI opponent.
  • We can switch the order in which the Debaters talk.
  • We can change the prompts, and thus to a good degree the behavior of the agents, and experiment with different prompts.
  • Make the debaters also perform web search before producing their statements, thus providing them with the latest information.

The broader implication of Deb8flow is how it showcases a pattern for composable AI agents. By defining clear boundaries and interactions (just like microservices in software), we can have complex AI-driven processes that remain interpretable and controllable. Each agent is like a cog in a machine, and LangGraph is the gear system making them work in unison.

I found this project energizing, and I hope it inspires you to explore multi-agent workflows. Whether it’s debating, collaborating on writing, or solving problems from different expert angles, the combination of GPT, tools, and structured agentic workflows opens up a new world of possibilities for AI development. Happy hacking!

References

[1] D. Bouchard, “From Basics to Advanced: Exploring LangGraph,” Medium, Nov. 22, 2023. [Online]. Available: https://medium.com/data-science/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787. [Accessed: Apr. 1, 2025].

[2] A. W. T. Ng, “Building a Research Agent that Can Write to Google Docs: Part 1,” Towards Data Science, Jan. 11, 2024. [Online]. Available: https://towardsdatascience.com/building-a-research-agent-that-can-write-to-google-docs-part-1-4b49ea05a292/. [Accessed: Apr. 1, 2025].

The post Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o appeared first on Towards Data Science.

]]>
How to Optimize your Python Program for Slowness https://towardsdatascience.com/how-to-optimize-your-python-program-for-slowness/ Tue, 08 Apr 2025 00:25:35 +0000 https://towardsdatascience.com/?p=605677 Write a short program that finishes after the universe dies

The post How to Optimize your Python Program for Slowness appeared first on Towards Data Science.

]]>

Also available: A Rust version of this article.

Everyone talks about making Python programs faster [1, 2, 3], but what if we pursue the opposite goal? Let’s explore how to make them slower — absurdly slower. Along the way, we’ll examine the nature of computation, the role of memory, and the scale of unimaginably large numbers.

Our guiding challenge: write short Python programs that run for an extraordinarily long time.

To do this, we’ll explore a sequence of rule sets — each one defining what kind of programs we’re allowed to write, by placing constraints on halting, memory, and program state. This sequence isn’t a progression, but a series of shifts in perspective. Each rule set helps reveal something different about how simple code can stretch time.

Here are the rule sets we’ll investigate:

  1. Anything Goes — Infinite Loop
  2. Must Halt, Finite Memory — Nested, Fixed-Range Loops
  3. Infinite, Zero-Initialized Memory — 5-State Turing Machine
  4. Infinite, Zero-Initialized Memory — 6-State Turing Machine (>10↑↑15 steps)
  5. Infinite, Zero-Initialized Memory — Plain Python (compute 10↑↑15 without Turing machine emulation)

Aside: 10↑↑15 is not a typo or a double exponent. It’s a number so large that “exponential” and “astronomical” don’t describe it. We’ll define it in Rule Set 4.

We start with the most permissive rule set. From there, we’ll change the rules step by step to see how different constraints shape what long-running programs look like — and what they can teach us.

Rule Set 1: Anything Goes — Infinite Loop

We begin with the most permissive rules: the program doesn’t need to halt, can use unlimited memory, and can contain arbitrary code.

If our only goal is to run forever, the solution is immediate:

while True:
  pass

This program is short, uses negligible memory, and never finishes. It satisfies the challenge in the most literal way — by doing nothing forever.

Of course, it’s not interesting — it does nothing. But it gives us a baseline: if we remove all constraints, infinite runtime is trivial. In the next rule set, we’ll introduce our first constraint: the program must eventually halt. Let’s see how far we can stretch the running time under that new requirement — using only finite memory.

Rule Set 2: Must Halt, Finite Memory — Nested, Fixed-Range Loops

If we want a program that runs longer than the universe will survive and then halts, it’s easy. Just write two nested loops, each counting over a fixed range from 0 to 10¹⁰⁰−1:

for a in range(10**100):
  for b in range(10**100):
      if b % 10_000_000 == 0:
          print(f"{a:,}, {b:,}")

You can see that this program halts after 10¹⁰⁰ × 10¹⁰⁰ steps. That’s 10²⁰⁰. And — ignoring the print—this program uses only a small amount of memory to hold its two integer loop variables—just 144 bytes.

My desktop computer runs this program at about 14 million steps per second. But suppose it could run at Planck speed (the smallest meaningful unit of time in physics). That would be about 10⁵⁰ steps per year — so 10¹⁵⁰ years to complete.

Current cosmological models estimate the heat death of the universe in 10¹⁰⁰ years, so our program will run about 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 times longer than the projected lifetime of the universe.

Aside: Practical concerns about running a program beyond the end of the universe are outside the scope of this article.

For an added margin, we can use more memory. Instead of 144 bytes for variables, let’s use 64 gigabytes — about what you’d find in a well-equipped personal computer. That’s about 500 million times more memory, which gives us about one billion variables instead of 2. If each variable iterates over the full 10¹⁰⁰ range, the total number of steps becomes roughly 10¹⁰⁰^(10⁹), or about 10^(100 billion) steps. At Planck speed — roughly 10⁵⁰ steps per year — that corresponds to 10^(100 billion − 50) years of computation.


Can we do better? Well, if we allow an unrealistic but interesting rule change, we can do much, much better.

Rule Set 3: Infinite, Zero-Initialized Memory — 5-State Turing Machine

What if we allow infinite memory — so long as it starts out entirely zero?

Aside: Why don’t we allow infinite, arbitrarily initialized memory? Because it trivializes the challenge. For example, you could mark a single byte far out in memory with a 0x01—say, at position 10¹²⁰—and write a tiny program that just scans until it finds it. That program would take an absurdly long time to run — but it wouldn’t be interesting. The slowness is baked into the data, not the code. We’re after something deeper: small programs that generate their own long runtimes from simple, uniform starting conditions.

My first idea was to use the memory to count upward in binary:

0
1
10
11
100
101
110
111
...

We can do that — but how do we know when to stop? If we don’t stop, we’re violating the “must halt” rule. So, what else can we try?

Let’s take inspiration from the father of Computer Science, Alan Turing. We’ll program a simple abstract machine — now known as a Turing machine — under the following constraints:

  • The machine has infinite memory, laid out as a tape that extends endlessly in both directions. Each cell on the tape holds a single bit: 0 or 1.
  • A read/write head moves across the tape. On each step, it reads the current bit, writes a new bit (0 or 1), and moves one cell left or right.
A read/write head positioned on an infinite tape.
  • The machine also has an internal variable called state, which can hold one of n values. For example, with 5 states, we might name the possible values A, B, C, D, and E—plus a special halting state H, which we don’t count among the five. The machine always starts in the first state, A.

We can express a full Turing machine program as a transition table. Here’s an example we’ll walk through step by step.

A 5-state Turing machine transition table.
  • Each row corresponds to the current tape value (0 or 1).
  • Each column corresponds to the current state (A through E).
  • Each entry in the table tells the machine what to do next:
    • The first character is the bit to write (0 or 1)
    • The second is the direction to move (L for left, R for right)
    • The third is the next state to enter (A, B, C, D, E, or H, where H is the special halting state).

Now that we’ve defined the machine, let’s see how it behaves over time.

We’ll refer to each moment in time — the full configuration of the machine and tape — as a step. This includes the current tape contents, the head position, and the machine’s internal state (like A, B, or H).

Below is Step 0. The head is pointing to a 0 on the tape, and the machine is in state A.

Looking at row 0, column A in the program table, we find the instruction 1RB. That means:

  • Write 1 to the current tape cell.
  • Move the head Right.
  • Enter state B.

Step 0:

This puts us in Step 1:

The machine is now in state B, pointing at the next tape cell (again 0).

What will happen if we let this Turing machine keep running? It will run for exactly 47,176,870 steps — and then halt. 

Aside: With a Google sign in, you can run this yourself via a Python notebook on Google Colab. Alternatively, you can copy and run the notebook locally on your own computer by downloading it from GitHub.

That number 47,176,870 is astonishing on its own, but seeing the full run makes it more tangible. We can visualize the execution using a space-time diagram, where each row shows the tape at a single step, from top (earliest) to bottom (latest). In the image:

  • The first row is blank — it shows the all-zero tape before the machine takes its first step.
  • 1s are shown in orange.
  • 0s are shown in white.
  • Light orange appears where 0s and 1s are so close together they blend.
Space-time diagram for the champion 5-state Turing machine. It runs for 47,176,870 steps before halting. Each row shows the tape at a single step, starting from the top. Orange represents 1, white represents 0.

In 2023, an online group of amateur researchers organized through bbchallenge.org proved that this is the longest-running 5-state Turing machine that eventually halts.


Want to see this Turing machine in motion? You can watch the full 47-million-step execution unfold in this pixel-perfect video:

Or interact with it directly using the Busy Beaver Blaze web app.

The video generator and web app are part of busy-beaver-blaze, the open-source Python & Rust project that accompanies this article.


It’s hard to believe that such a small machine can run 47 million steps and still halt. But it gets even more astonishing: the team at bbchallenge.org found a 6-state machine with a runtime so long it can’t even be written with ordinary exponents.

Rule Set 4: Infinite, Zero-Initialized Memory — 6-State Turing Machine (>10↑↑15 steps)

As of this writing, the longest running (but still halting) 6-state Turing machine known to humankind is:

A   B   C   D   E   F
0   1RB 1RC 1LC 0LE 1LF 0RC
1   0LD 0RF 1LA 1RH 0RB 0RE

Here is a video showing its first 10 trillion steps:

And here you can run it interactively via a web app.

So, if we are patient — comically patient — how long will this Turing machine run? More than 10↑↑15 where “10 ↑↑ 15” means:

This is not the same as 10¹⁵ (which is just a regular exponent). Instead:

  • 10¹ = 10
  • 10¹⁰ = 10,000,000,000
  • 10^10^10 is 10¹⁰⁰⁰⁰⁰⁰⁰⁰⁰⁰, already unimaginably large.
  • 10↑↑4 is so large that it vastly exceeds the number of atoms in the observable universe.
  • 10↑↑15 is so large that writing it in exponent notation becomes annoying.

Pavel Kropitz announced this 6-state machine on May 30, 2022. Shawn Ligocki has a great write up explaining both his and Pavel’s discoveries. To prove that these machines run so long and then halt, researchers used a mix of analysis and automated tools. Rather than simulating every step, they identified repeating structures and patterns that could be proven — using formal, machine-verified proofs — to eventually lead to halting.


Up to this point, we’ve been talking about Turing machines — specifically, the longest-known 5- and 6-state machines that eventually halt. We ran the 5-state champion to completion and watched visualizations to explore its behavior. But the discovery that it’s the longest halting machine with 5 states — and the identification of the 6-state contender — came from extensive research and formal proofs, not from running them step-by-step.

That said, the Turing machine interpreter I built in Python can run for millions of steps, and the visualizer written in Rust can handle trillions (see GitHub). But even 10 trillion steps isn’t an atom in a drop of water in the ocean compared to the full runtime of the 6-state machine. And running it that far doesn’t get us any closer to understanding why it runs so long.

Aside: Python and Rust “interpreted” the Turing machines up to some point — reading their transition tables and applying the rules step by step. You could also say they “emulated” them, in that they reproduced their behavior exactly. I avoid the word “simulated”: a simulated elephant isn’t an elephant, but a simulated computer is a computer.

Returning to our central challenge: we want to understand what makes a short program run for a long time. Instead of analyzing these Turing machines, let’s construct a Python program whose 10↑↑15 runtime is clear by design.

Rule Set 5: Infinite, Zero-Initialized Memory — Plain Python (compute 10↑↑15 without Turing machine emulation)

Our challenge is to write a small Python program that runs for at least 10↑↑15 steps, using any amount of zero-initialized memory.

To achieve this, we’ll compute the value of 10↑↑15 in a way that guarantees the program takes at least that many steps. The ↑↑ operator is called tetration—recall from Rule Set 4 that ↑↑ stacks exponents: for example, 10↑↑3 means 10^(10^10). It’s an extremely fast-growing function. We will program it from the ground up.

Rather than rely on built-in operators, we’ll define tetration from first principles:

  • Tetration, implemented by the function tetrate, as repeated exponentiation
  • Exponentiation, via exponentiate, as repeated multiplication
  • Multiplication, via multiply, as repeated addition
  • Addition, via add, as repeated increment

Each layer builds on the one below it, using only zero-initialized memory and in-place updates.

We’ll begin at the foundation — with the simplest operation of all: increment.

Increment

Here’s our definition of increment and an example of its use:

from gmpy2 import xmpz

def increment(acc_increment):
  assert is_valid_accumulator(acc_increment), "not a valid accumulator"
  acc_increment += 1

def is_valid_accumulator(acc):
  return isinstance(acc, xmpz) and acc >= 0  

b = xmpz(4)
print(f"++{b} = ", end="")
increment(b)
print(b)
assert b == 5

Output:

++4 = 5

We’re using xmpz, a mutable arbitrary-precision integer type provided by the gmpy2 library. It behaves like Python’s built-in int in terms of numeric range—limited only by memory—but unlike int, it supports in-place updates.

To stay true to the spirit of a Turing machine and to keep the logic minimal and observable, we restrict ourselves to just a few operations:

  • Creating an integer with value 0 (xmpz(0))
  • In-place increment (+= 1) and decrement (-= 1)
  • Comparing with zero

All arithmetic is done in-place, with no copies and no temporary values. Each function in our computation chain modifies an accumulator directly. Most functions also take an input value a, but increment—being the most basic—does not. We use descriptive names like increment_acc, add_acc, and so on to make the operation clear and to support later functions where multiple accumulators will appear together.

Aside: Why not use Python’s built-in int type? It supports arbitrary precision and can grow as large as your memory allows. But it’s also immutable, meaning any update like += 1 creates a new integer object. Even if you think you’re modifying a large number in place, Python is actually copying all of its internal memory—no matter how big it is.
For example:

x = 10**100
y = x
x += 1
assert x == 10**100 + 1 and y == 10**100

Even though x and y start out identical, x += 1 creates a new object—leaving y unchanged. This behavior is fine for small numbers, but it violates our rules about memory use and in-place updates. That’s why we use gmpy2.xmpz, a mutable arbitrary-precision integer that truly supports efficient, in-place changes.

Addition

With increment defined, we next define addition as repeated incrementing.

def add(a, add_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(add_acc), "not a valid accumulator"
  for _ in range(a):
      add_acc += 1

def is_valid_other(a):
  return isinstance(a, int) and a >= 0      

a = 2
b = xmpz(4)
print(f"Before: id(b) = {id(b)}")
print(f"{a} + {b} = ", end="")
add(a, b)
print(b)
print(f"After:  id(b) = {id(b)}")  # ← compare object IDs
assert b == 6

Output:

Before: id(b) = 2082778466064
2 + 4 = 6
After:  id(b) = 2082778466064

The function adds a to add_acc by incrementing add_acc one step at a time, a times. The before and after ids are the same, showing that no new object was created—add_acc was truly updated in place.

Aside: You might wonder why add doesn’t just call our increment function. We could write it that way—but we’re deliberately inlining each level by hand. This keeps all loops visible, makes control flow explicit, and helps us reason precisely about how much work each function performs.

Even though gmpy2.xmpz supports direct addition, we don’t use it. We’re working at the most primitive level possible—incrementing by 1—to keep the logic simple, intentionally slow, and to make the amount of work explicit.

As with increment_acc, we update add_acc in place, with no copying or temporary values. The only operation we use is += 1, repeated a times.

Next, we define multiplication.

Multiplication

With addition in place, we can now define multiplication as repeated addition. Here’s the function and example usage. Unlike add and increment, this one builds up a new xmpz value from zero and returns it.

def multiply(a, multiply_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(multiply_acc), "not a valid accumulator"

  add_acc = xmpz(0)
  for _ in count_down(multiply_acc):
      for _ in range(a):
          add_acc += 1
  return add_acc

def count_down(acc):
  assert is_valid_accumulator(acc), "not a valid accumulator"
  while acc > 0:
      acc -= 1
      yield

a = 2
b = xmpz(4)
print(f"{a} * {b} = ", end="")
c = multiply(a, b)
print(c)
assert c == 8
assert b == 0

Output:

2 * 4 = 8

This multiplies a by the value of multiply_acc by adding a to add_acc once for every time multiply_acc can be decremented. The result is returned and then assigned to c. The original multiply_acc is decremented to zero and consumed in the process.

You might wonder what this line does:

for _ in count_down(multiply_acc):

While xmpz technically works with range(), doing so converts it to a standard Python int, which is immutable. That triggers a full copy of its internal memory—an expensive operation for large values. Worse, each decrement step would involve allocating a new integer and copying all previous bits, so what should be a linear loop ends up doing quadratic total work. Our custom count_down() avoids all that by decrementing in place, yielding control without copying, and maintaining predictable memory use.

We’ve built multiplication from repeated addition. Now it’s time to go a layer further: exponentiation.

Exponentiation

We define exponentiation as repeated multiplication. As before, we perform all work using only incrementing, decrementing, and in-place memory. As with multiply, the final result is returned while the input accumulator is consumed.

Here’s the function and example usage:

def exponentiate(a, exponentiate_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(exponentiate_acc), "not a valid accumulator"
  assert a > 0 or exponentiate_acc != 0, "0^0 is undefined"

  multiply_acc = xmpz(0)
  multiply_acc += 1
  for _ in count_down(exponentiate_acc):
      add_acc = xmpz(0)
      for _ in count_down(multiply_acc):
          for _ in range(a):
              add_acc += 1
      multiply_acc = add_acc
  return multiply_acc


a = 2
b = xmpz(4)
print(f"{a}^{b} = ", end="")
c = exponentiate(a, b)
print(c)
assert c == 16
assert b == 0

Output:

2^4 = 16

This raises a to the power of exponentiate_acc, using only incrementing, decrementing, and loop control. We initialize multiply_acc to 1 with a single increment—because repeatedly multiplying from zero would get us nowhere. Then, for each time exponentiate_acc can be decremented, we multiply the current result (multiply_acc) by a. As with the earlier layers, we inline the multiply logic directly instead of calling the multiply function—so the control flow and step count stay fully visible.

Aside: And how many times is += 1 called? Obviously at least 2⁴ times—because our result is 2⁴, and we reach it by incrementing from zero. More precisely, the number of increments is:

• 1 increment — initializing multiply_acc to one

Then we loop four times, and in each loop, we multiply the current value of multiply_acc by a = 2, using repeated addition:

• 2 increments — for multiply_acc = 1, add 2 once
• 4 increments — for multiply_acc = 2, add 2 twice
• 8 increments — for multiply_acc = 4, add 2 four times
• 16 increments — for multiply_acc = 8, add 2 eight times

That’s a total of 1 + 2 + 4 + 8 + 16 = 31 increments, which is 2⁵-1. In general, the number of calls to increment will be exponential, but the number is not the same exponential that we are computing.

With exponentiation defined, we’re ready for the top of our tower: tetration.

Tetration

Here’s the function and example usage:

def tetrate(a, tetrate_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(tetrate_acc), "not a valid accumulator"
  assert a > 0, "we don't define 0↑↑b"

  exponentiate_acc = xmpz(0)
  exponentiate_acc += 1
  for _ in count_down(tetrate_acc):
      multiply_acc = xmpz(0)
      multiply_acc += 1
      for _ in count_down(exponentiate_acc):
          add_acc = xmpz(0)
          for _ in count_down(multiply_acc):
              for _ in range(a):
                  add_acc += 1
          multiply_acc = add_acc
      exponentiate_acc = multiply_acc
  return exponentiate_acc


a = 2
b = xmpz(3)
print(f"{a}↑↑{b} = ", end="")
c = tetrate(a, b)
print(c)
assert c == 16  # 2^(2^2)
assert b == 0   # Confirm tetrate_acc is consumed

Output:

2↑↑3 = 16

This computes a ↑↑ tetrate_acc, meaning it exponentiates a by itself repeatedly, tetrate_acc times.

For each decrement of tetrate_acc, we exponentiate the current value. We in-line the entire exponentiate and multiply logic again, all the way down to repeated increments.

As expected, this computes 2^(2^2) = 16. With a Google sign-in, you can run this yourself via a Python notebook on Google Colab. Alternatively, you can copy the notebook from GitHub and then run it on your own computer.

We can also run tetrate on 10↑↑15. It will start running, but it won’t stop during our lifetimes — or even the lifetime of the universe:

a = 10
b = xmpz(15)
print(f"{a}↑↑{b} = ", end="")
c = tetrate(a, b)
print(c)

Let’s compare this tetrate function to what we found in the previous Rule Sets.

Rule Set 1: Anything Goes — Infinite Loop

Recall our first function:

while True:
  pass

Unlike this infinite loop, our tetrate function eventually halts — though not anytime soon.

Rule Set 2: Must Halt, Finite Memory — Nested, Fixed-Range Loops

Recall our second function:

for a in range(10**100):
  for b in range(10**100):
      if b % 10_000_000 == 0:
          print(f"{a:,}, {b:,}")

Both this function and our tetrate function contain a fixed number of nested loops. But tetrate differs in an important way: the number of loop iterations grows with the input value. In this function, in contrast, each loop runs from 0 to 10¹⁰⁰-1—a hardcoded bound. In contrast, tetrate’s loop bounds are dynamic — they grow explosively with each layer of computation.

Rule Sets 3 & 4: Infinite, Zero-Initialized Memory — 5- and 6-State Turing Machines

Compared to the Turing machines, our tetrate function has a clear advantage: we can directly see that it will call += 1 more than 10↑↑15 times. Even better, we can also see — by construction — that it halts.

What the Turing machines offer instead is a simpler, more universal model of computation — and perhaps a more principled definition of what counts as a “small program.”

Conclusion

So, there you have it — a journey through writing absurdly slow programs. Along the way, we explored the outer edges of computation, memory, and performance, using everything from deeply nested loops to Turing machines to a hand-inlined tetration function.

Here’s what surprised me:

  • Nested loops are enough.
    If you just want a short program that halts after outliving the universe, two nested loops with 144 bytes of memory will do the job. I hadn’t realized it was that simple.
  • Turing machines escalate fast.
    The jump from 5 to 6 states unleashes a dramatic leap in complexity and runtime. Also, the importance of starting with zero-initialized memory is obvious in retrospect — but it wasn’t something I’d considered before.
  • Python’s int type can kill performance
    Yes, Python integers are arbitrary precision, which is great. But they’re also immutable. That means every time you do something like x += 1, Python silently allocates a brand-new integer object—copying all the memory of x, no matter how big it is. It feels in-place, but it’s not. This behavior turns efficient-looking code into a performance trap when working with large values. To get around this, we use the gmpy2.xmpz type—a mutable, arbitrary-precision integer that allows true in-place updates.
  • There’s something beyond exponentiation — and it’s called tetration.
    I didn’t know this. I wasn’t familiar with the ↑↑ notation or the idea that exponentiation could itself be iterated to form something even faster-growing. It was surprising to learn how compactly it can express numbers that are otherwise unthinkably large.
    And because I know you’re asking — yes, there’s something beyond tetration too. It’s called pentation, then hexation, and so on. These are part of a whole hierarchy known as hyperoperations. There’s even a metageneralization: systems like the Ackermann function and fast-growing hierarchies capture entire families of these functions and more.
  • Writing Tetration with Explicit Loops Was Eye-Opening
    I already knew that exponentiation is repeated multiplication, and so on. I also knew this could be written recursively. What I hadn’t seen was how cleanly it could be written as nested loops, without copying values and with strict in-place updates.

Thank you for joining me on this journey. I hope you now have a clearer understanding of how small Python programs can run for an astonishingly long time — and what that reveals about computation, memory, and minimal systems. We’ve seen programs that halt only after the universe dies, and others that run even longer.

Please follow Carl on Towards Data Science and on @carlkadie.bsky.social. I write on scientific programming in Python and Rust, machine learning, and statistics. I tend to write about one article per month.

The post How to Optimize your Python Program for Slowness appeared first on Towards Data Science.

]]>
Let’s Call a Spade a Spade: RDF and LPG — Cousins Who Should Learn to Live Together https://towardsdatascience.com/lets-call-a-spade-a-spade-rdf-and-lpg-cousins-who-should-learn-to-live-together/ Mon, 07 Apr 2025 23:28:25 +0000 https://towardsdatascience.com/?p=605674 An objective comparison of the RDF and LPG data models 

The post Let’s Call a Spade a Spade: RDF and LPG — Cousins Who Should Learn to Live Together appeared first on Towards Data Science.

]]>
In recent years, there has been a proliferation of articles, LinkedIn posts, and marketing materials presenting graph data models from different perspectives. This article will refrain from discussing specific products and instead focus solely on the comparison of RDF (Resource Description Framework) and LPG (Labelled Property Graph) data models. To clarify, there is no mutually exclusive choice between RDF and LPG — they can be employed in conjunction. The appropriate choice depends on the specific use case, and in some instances both models may be necessary; there is no single data model that is universally applicable. In fact, polyglot persistence and multi—model databases (databases that can support different data models within the database engine or on top of the engine), are gaining popularity as enterprises recognise the importance of storing data in diverse formats to maximise its value and prevent stagnation. For instance, storing time series financial data in a graph model is not the most efficient approach, as it could result in minimal value extraction compared to storing it in a time series matrix database, which enables rapid and multi—dimensional analytical queries.

The purpose of this discussion is to provide a comprehensive comparison of RDF and Lpg data models, highlighting their distinct purposes and overlapping usage. While articles often present biased evaluations, promoting their own tools, it is essential to acknowledge that these comparisons are often flawed, as they compare apples to wheelbarrows rather than apples to apples. This subjectivity can leave readers perplexed and uncertain about the author’s intended message. In contrast, this article aims to provide an objective analysis, focusing on the strengths and weaknesses of both RDF and LPG data models, rather than acting as promotional material for any tool.

Quick recap of the data models

Both Rdf and LPG are descendants of the graph data model, although they possess different structures and characteristics. A graph comprises vertices (nodes) and edges that connect two vertices. Various graph types exist, including undirected graphs, directed graphs, multigraphs, hypergraphs and so on. The RDF and LPG data models adopt the directed multigraph approach, wherein edges have the “from” and “to” ordering, and can join an arbitrary number of distinct edges. 

The RDF data model is represented by a set of triples reflecting the natural language structure of subject—verb—object, with the subject, predicate, and object represented as such. Consider the following simple example: Jeremy was born in Birkirkara. This sentence can be represented as an RDF statement or fact with the following structure — Jeremy is a subject resource, the predicate (relation) is born in, and the object value of Birkirkara. The value node could either be a URI (unique resource identifier) or a datatype value (such as integer or string). If the object is a semantic URI, or as they are also known a resource, then the object would lead to other facts, such as Birkirkara townIn Malta. This data model allows for resources to be reused and interlinked in the same RDF—based graph, or in any other RDF graph, internal or external. Once a resource is defined and a URI is “minted”, this URI becomes instantly available and can be used in any context that is deemed necessary. 

On the other hand, the LPG data model encapsulates the set of vertices, edges, label assignment functions for vertices and edges, and key—value property assignment function for vertices and edges. For the previous example, the representation would be as follows:


(person:Person {name: "Jeremy"})

(city:City {name: "Birkirkara"}) 

(person)—[:BORN_IN]—>(city)

Consequently, the primary distinction between RDF and LPG lies within how nodes are connected together. In the RDF model, relationships are triples where predicates define the connection. In the LPG data model, edges are first—class citizens with their own properties. Therefore, in the RDF data model, predicates are globally defined in a schema and are reused in data graphs, whilst in the LPG data model, each edge is uniquely identified.

Schema vs Schema—less. Do semantics matter at all?

Semantics is a branch of linguistics and logic that is concerned about the meaning, in this case the meaning of data, enabling both humans and machines to interpret the context of the data and any relationships in the said context.

Historically, the World Wide Web Consortium (W3C) established the Resource Description Framework (RDF) data model as a standardised framework for data exchange within the Web. RDF facilitates seamless data integration and the merging of diverse sources, while simultaneously supporting schema evolution without necessitating modifications to data consumers. Schemas1, or ontologies, serve as the foundation for data represented in RDF, and through these ontologies the semantic meaning of the data can be defined. This capability makes data integration one of the numerous suitable applications of the RDF data model. Through various W3C groups, standards were established on how schemas and ontologies can be defined, primarily RDF Schema (RDFS), Web Ontology Language (OWL), and recently SHACL. RDFS provides the low—level constructs for defining ontologies, such as the Person entity with properties name, gender, knows, and the expected type of node. OWL provides constructs and mechanisms for formally defining ontologies through axioms and rules, enabling the inference of implicit data. Whilst OWL axioms are taken as part of the knowledge graph and used to infer additional facts, SHACL was introduced as a schema to validate constraints, better known as data shapes (consider it as “what should a Person consist of?”) against the knowledge graph. Moreover, through additional features to the SHACL specifications, rules and inference axioms can also be defined using SHACL.

In summary, schemas facilitate the enforcement of the right instance data. This is possible because the RDF permits any value to be defined within a fact, provided it adheres to the  specifications. Validators, such as in—built SHACL engines or OWL constructs, are responsible for verifying the data’s integrity. Given that these validators are standardised, all triple stores, those adhering to the RDF data model, are encouraged to implement them. However, this does not negate the concept of flexibility. The RDF data model is designed to accommodate the growth, extension, and evolution of data within the schema’s boundaries. Consequently, while an RDF data model strongly encourages the use of schemas (or ontologies) as its foundation, experts discourage the creation of ivory tower ontologies. This endeavour does require an upfront effort and collaboration with domain experts to construct an ontology that accurately reflects the use case and the data that will be stored in the knowledge graph. Nonetheless, the RDF data model offers the flexibility to create and define RDF—based data independently of a pre—existing ontology, or to develop an ontology iteratively throughout a data project. Furthermore, schemas are designed for reuse, and the RDF data model facilitates this reusability. It is noteworthy that an RDF—based knowledge graph typically encompasses both instance data (such as “Giulia and Matteo are siblings”) and ontology/schema axioms (such as “Two people are siblings when they have a parent in common”).

Nonetheless, the significance of ontologies extends beyond providing a data structure; they also impart semantic meaning to the data. For instance, in constructing a family tree, an ontology enables the explicit definition of relationships such as aunt, uncle, cousins, niece, nephew, ancestors, and descendants without the need for the explicit data to be defined in the knowledge graph. Consider how this concept can be applied in various pharmaceutical scenarios, just to mention one vertical domain. Reasoning is a fundamental component that renders the RDF data model a semantically powerful model for designing knowledge graphs. Ontologies provide a particular data point with all the necessary context, including its neighbourhood and its meaning. For instance, if there is a literal node with the value 37, an RDF—based agent can comprehend that the value 37 represents the age of a person named Jeremy, who is the nephew of a person named Peter.

In contrast, the LPG data model offers a more agile and straightforward deployment of graph data. LPGs have reduced focus on schemas (they only support some constraints and “labels”/classes). Graph databases adhering to the LPG data model are known for their speed in preparing data for consumption due to its schema—less nature. This makes them a more suitable choice for data architects seeking to deploy their data in such a manner. The LPG data model is particularly advantageous in scenarios where data is not intended for growth or significant changes. For instance, a modification to a property would necessitate refactoring the graph to update nodes with the newly added or updated key—value property. While LPG provides the illusion of providing semantics through node and edge labels and corresponding functions, it does not inherently do so. LPG functions consistently return a map of values associated with a node or edge. Nonetheless, this is fundamental when dealing with use cases that need to perform fast graph algorithms as the data is available directly in the nodes and edges, and there is no need for further graph traversal.

However, one fundamental feature of the LPG data model is its ease and flexibility of attaching granular attributes or properties to either vertices or edges. For instance, if there are two person nodes, “Alice” and “Bob,” with an edge labelled “marriedTo,” the LPG data model can accurately and easily state that Alice and Bob were married on February 29, 2024. In contrast, the RDF data model could achieve this through various workarounds, such as reification, but this would result in more complex queries compared to the LPG data model’s counterpart.

Standards, Standardisation Bodies, Interoperability.

In the previous section we described how W3C provides standardisation groups pertaining to the RDF data model. For instance, a W3C working group is actively developing the RDF* standard, which incorporates the complex relationship concept (attaching attributes to facts/triples) within the RDF data model. This standard is anticipated to be adopted and supported by all triple stores tools and agents based on the RDF data model. However, the process of standardisation can be protracted, frequently resulting in delays that leave such vendors at a disadvantage.

Nonetheless, standards facilitate much—needed interoperability. Knowledge Graphs built upon the RDF data model can be easily ported between different applications and triple store, as they have no vendor lock—in, and standardisation formats are provided. Similarly, they can be queried with one standard query language called SPARQL, which is used by the different vendors. Whilst the query language is the same, vendors opt for different query execution plans, equivalent to how any database engine (SQL or NoSQL) is implemented, to enhance performance and speed.

Most LPG graph implementations, although open source, utilise proprietary or custom languages for storing and querying data, lacking a standard adherence. This practice decreases interoperability and portability of data between different vendors. However, in recent months, ISO approved and published ISO/IEC 39075:2024 that standardises the Graph Query Language (GQL) based on Cypher. As the charter rightly points out, the graph data model has unique advantages over relational databases such as fitting data that is meant to have hierarchical, complex or arbitrary structures. Nevertheless, the proliferation of vendor—specific implementations overlooks a crucial functionality – a standardised approach to querying property graphs. Therefore, it is paramount that property graph vendors reflect their products to this standard.

Recently, OneGraph2 was proposed as an interoperable metamodel that is meant to overcome the choice between the RDF data model and the LPG data model. Furthermore, extensions to openCypher are proposed3 to allow the querying over RDF data to be extended as a way of querying over RDF data. This vision aims to pave the way for having data in both RDF and LPG combined in a single, integrated database, ensuring the benefits of both data models. 

Other notable differences

Notable differences, mostly in query languages, are there to support the data models. However, we strongly argue against the fact that a set of query language features should dictate which data model to use. Nonetheless, we will discuss some of the differences here for a more complete overview.

The RDF data model offers a natural way of supporting global unique resource identifiers (URIs), which manifest in three distinct characteristics. Within the RDF domain, a set of facts described by an RDF statement (i.e. s, p, o) having the same subject URI is referred to as a resource. Data stored in RDF graphs can be conveniently split into multiple named graphs, ensuring that each graph encapsulates distinct concerns. For instance, using the RDF data model it is straightforward to construct graphs that store data or resources, metadata, audit and provenance data separately, whilst interlinking and querying capabilities can be seamlessly executed across these multiple graphs. Furthermore, graphs can establish interlinks with resources located in graphs hosted on different servers. Querying these external resources is facilitated through query federation within the SPARQL protocol. Given the adoption of URIs, RDF embodies the original vision of Linked Data4, a vision that has since been adopted, to an extent, as a guiding principle in the FAIR principles5, Data Fabric, Data Mesh, and HATEOAS amongst others. Consequently, the RDF data model serves as a versatile framework that can seamlessly integrate with these visions without the need for any modifications.

LPGs, on the other hand, are better geared towards path traversal queries, graph analytics and variable length path queries. Whilst these functionalities can be considered as specific implementations in the query language, they are pertinent considerations when modelling data in a graph, since these are also benefits over traditional relational databases. SPARQL, through the W3C recommendation, has limited support to path traversal6, and some vendor triple store implementations do support and implement (although not as part of the SPARQL 1.1 recommendation) variable length path7. At time of writing, the SPARQL 1.2 recommendation will not incorporate this feature either.

Data Graph Patterns

The following section describes various data graph patterns and how they would fit, or not, both data models discussed in this article.

PatternRDF data modelLPG data model
Global Definition of relations/propertiesThrough schemas properties are globally defined through various semantic properties such as domain and ranges, algebraic properties such as inverse of, reflexive, transitive, and allow for informative annotations on properties definitions.Semantics of relations (edges) is not supported in property graphs
Multiple LanguagesString data can have a language tag attached to it and is considered when processingCan be a custom field or relationship (e.g. label_en, label_mt) but have no special treatment.
Taxonomy – HierarchyAutomatic inferencing, reasoning and can handle complex classes.Can model hierarchies, but not model hierarchies of classes of individuals. Would require explicit traversal of classification hierarchies
Individual RelationshipsRequires workarounds like reification and complex queries.Can make direct assertions over them, natural representation and efficient querying.
Property InheritanceProperties inherited through defined class hierarchies. Furthermore, the RDF data model has the ability to represent subproperties.Must be handled in application logic.
N—ary RelationsGenerally binary relationships are represented in triples, but N—ary relations can be done via blank nodes, additional resources, or reification.Can often be translated to additional attributes on edges.
Property Constraints and ValidationAvailable through schema definitions: RDFS, OWL or SHACL.Supports minimal constraints such as value uniqueness but generally requires validation through schema layers or application logic.
Context and ProvenanceCan be done in various ways, including having a separate named graph and links to the main resources, or through reification.Can add properties to nodes and edges to capture context and provenance.
InferencingAutomate the inferencing of inverse relationships, transitive patterns, complex property chains, disjointness and negation.Either require explicit definition, in application logic, or no support at all (disjointness and negation).

Semantics in Graphs — A Family Tree Example

A comprehensive exploration of the application of RDF data model and semantics within an LPG application can be found in various articles published on Medium, LinkedIn, and other blogs. As outlined in the previous section, the LPG data model is not specifically designed for reasoning purposes. Reasoning involves applying logical rules on existing facts as a way to deduce new knowledge; this is important as it helps uncover hidden relationships that were not explicitly stated before. 

In this section we will demonstrate how axioms are defined for a simple yet practical example of a family tree. A family tree is an ideal candidate for any graph database due to its hierarchical structure and its flexibility in being defined within any data model. For this demonstration, we will model the Pewterschmidt family, which is a fictional family from the popular animated television series Family Guy.

All images, unless otherwise noted, are by the author.

In this case, we are just creating one relationship called ‘hasChild’. So, Carter has a child named Lois, and so on. The only other attribute we’re adding is the gender (Male/Female). For the RDF data model, we have created a simple OWL ontology:

A diagram of a child

AI-generated content may be incorrect.

The current schema enables us to represent the family tree in an RDF data model. With ontologies, we can commence defining the following properties, whose data can be deduced from the initial data. We introduce the following properties:

PropertyCommentAxiomExample
isAncestorOfA transitive property which is also the inverse of the isDescendentOf property. OWL engines automatically infer transitive properties without the need of rules.hasChild(?x, ?y) —> isAncestorOf(?x, ?y)Carter – isAncestorOf —> Lois – isAncestorOf —> Chris
Carter  – isAncestorOf  —> Chris
isDescendentOfA transitive property, inverse of isAncestorOf. OWL engines automatically infers inverse properties without the need of rulesChris – isDescendentOf —> Peter
isBrotherOfA subproperty of isSiblingOf and disjoint with isSisterOf, meaning that the same person cannot be the brother and the sister of another person at the same time, whilst they cannot be the brother of themselves.hasChild(?x, ?y), hasChild(?x, ?z), hasGender(?y, Male), notEqual(?y, ?z) —> isBrotherOf(?y, ?z)Chris – isBrotherOf —> Meg
isSisterOfA subproperty of isSiblingOf and disjoint with isBrotherOf, meaning that the same person cannot be the brother and the sister or another person at the same time, whilst they cannot be the brother of themselves.hasChild(?x, ?y), hasChild(?x, ?z), hasGender(?y, Female), notEqual(?y, ?z) —> isSisterOf(?y, ?z)Meg – isSisterOf —> Chris
isSiblingOfA super—property of isBrotherOf and isSisterOf. OWL engines automatically infers super—propertiesChris –  isSiblingOf —> Meg
isNephewOfA property that infers the aunts and uncles of children based on their gender.isSiblingOf(?x, ?y), hasChild(?x, ?z), hasGender(?z, Male), notEqual(?y, ?x) —> isNephewOf(?z, ?yStewie – isNephewOf —> Carol
isNieceOfA property that infers the aunts and uncles of children based on their gender.isSiblingOf(?x, ?y), hasChild(?x, ?z), hasGender(?z, Female), notEqual(?y, ?x) —> isNieceOf(?z, ?y)Meg – isNieceOf —> Carol

These axioms are imported into a triple store, to which the engine will apply them to the explicit facts in real—time. Through these axioms, triple stores allow the querying of inferred/hidden triples.. Therefore, if we want to get the explicit information about Chris Griffin, the following query can be executed:

SELECT ?p ?o WHERE {
 <http://example.org/ChrisGriffin> ?p ?o EXPLICIT true
}

If we need to get the inferred values for Chris, the SPARQL engine will provide us with 10 inferred facts:

SELECT ?p ?o WHERE {
 <http://example.org/ChrisGriffin> ?p ?o EXPLICIT false
}

This query will return all implicit facts for Chris Griffin. The image below shows the discovered facts. These are not explicitly stored in the triple store.

These results could not be produced by the property graph store, as no reasoning could be applied automatically. 

The RDF data model empowers users to discover previously unknown facts, a capability that the LPG data model lacks. Nevertheless, LPG implementations can bypass this limitation by developing complex stored procedures. However, unlike in RDF, these stored procedures may have variations (if at all possible) across different vendor implementations, rendering them non—portable and impractical.

Take-home message

In this article, the RDF and LPG data models have been presented objectively. On the one hand, the LPG data model offers a rapid deployment of graph databases without the need for an advanced schema to be defined (i.e. it is schema—less). Conversely, the RDF data model requires a more time—consuming bootstrapping process for graph data, or knowledge graph, due to its schema definition requirement. However, the decision to adopt one model over the other should consider whether the additional effort is justified in providing meaningful context to the data. This consideration is influenced by specific use cases. For instance, in social networks where neighbourhood exploration is a primary requirement, the LPG data model may be more suitable. On the other hand, for more advanced knowledge graphs that necessitate reasoning or data integration across multiple sources, the RDF data model is the preferred choice. 

It is crucial to avoid letting personal preferences for query languages dictate the choice of data model. Regrettably, many articles available primarily serve as marketing tools rather than educational resources, hindering adoption and creating confusion within the graph database community. Furthermore, in the era of abundant and accessible information, it would be better for vendors to refrain from promoting misinformation about opposing data models. A general misconception promoted by property graph evangelists is that the RDF data model is overly complex and academic, leading to its dismissal. This assertion is based on a preferential prejudice. RDF is both a machine and human readable data model that is close to business language, especially through the definition of schemas and ontologies. Moreover, the adoption of the RDF data model is widespread. For instance, Google uses the RDF data model as their standard to represent meta—information about web pages using schema.org. There is also the assumption that the RDF data model will exclusively function with a schema. This is also a misconception, as after all, the data defined using the RDF data model could also be schema—less. However, it is acknowledged that all semantics would be lost, and the data will be reduced to simply graph data. This article also mentions how the oneGraph vision aims to establish a bridge between the two data models.

To conclude, technical feasibility alone should not drive implementation decisions in which graph data model to select. Reducing higher—level abstractions to primitive constructs often increases complexity and can impede solving specific use cases effectively. Decisions should be guided by use case requirements and performance considerations rather than merely what is technically possible.


The author would like to thank Matteo Casu for his input and review. This article is dedicated to Norm Friend, whose untimely demise left a void in the Knowledge Graph community.


1 Schemas and ontologies are used interchangeably in this article.
2 Lassila, O. et al. The OneGraph Vision: Challenges of Breaking the Graph Model Lock—In. https://www.semantic-web-journal.net/system/files/swj3273.pdf.
3 Broekema, W. et al. openCypher Queries over Combined RDF and LPG Data in Amazon Neptune. https://ceur-ws.org/Vol-3828/paper44.pdf.
4 https://www.w3.org/DesignIssues/LinkedData.html
5 https://www.go-fair.org/fair-principles

The post Let’s Call a Spade a Spade: RDF and LPG — Cousins Who Should Learn to Live Together appeared first on Towards Data Science.

]]>
Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data https://towardsdatascience.com/are-we-watching-more-ads-than-content-analyzing-youtube-sponsor-data/ Fri, 04 Apr 2025 00:16:48 +0000 https://towardsdatascience.com/?p=605408 Exploring if sponsor segments are getting longer by the year

The post Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data appeared first on Towards Data Science.

]]>
I’m definitely not the only person who feels that YouTube sponsor segments have become longer and more frequent recently. Sometimes, I watch videos that seem to be trying to sell me something every couple of seconds.

On one hand, it’s great that both small and medium-sized YouTubers are able to make a living from their craft, but on the other hand, it sure is annoying to be bombarded by ads. 

In this blog post, I will explore these sponsor segments, using data from a popular browser extension called SponsorBlock, to figure out if the perceived increase in ads actually did happen and also to quantify how many ads I’m watching.

I will walk you through my analysis, providing code snippets in Sql, DuckDB, and pandas. All the code is available on my GitHub, and since the dataset is open, I will also teach you how to download it, so that you can follow along and play with the data yourself.

These are the questions I will be trying to answer in this analysis:

  • Have sponsor segments increased over the years?
  • Which channels have the highest percentage of sponsor time per video?
  • What is the density of sponsor segments throughout a video?

To get to these answers, we will have to cover much ground. This is the agenda for this post:

Let’s get this started!

How SponsorBlock Works

SponsorBlock is an extension that allows you to skip ad segments in videos, similar to how you skip Netflix intros. It’s incredibly accurate, as I don’t remember seeing one wrong segment since I started using it around a month ago, and I watch a lot of smaller non-English creators.

You might be asking yourself how the extension knows which parts of the video are sponsors, and, believe it or not, the answer is through crowdsourcing!

Users submit the timestamps for the ad segments, and other users vote if it’s accurate or not. For the average user, who isn’t contributing at all, the only thing you have to do is to press Enter to skip the ad.

Okay, now that you know what SponsorBlock is, let’s talk about the data. 

Cleaning the Data

If you want to follow along, you can download a copy of the data using this SponsorBlock Mirror (it might take you quite a few minutes to download it all). The database schema can be seen here, although most of it won’t be useful for this project.

As one might expect, their database schema is made for the extension to work properly, and not for some guy to basically leech from a huge community effort to find what percentage of ads his favorite creator runs. For this, some work will need to be done to clean and model the data.

The only two tables that are important for this analysis are:

  • sponsorTimes.csv : This is the most important table, containing the startTime and endTime of all crowdsourced sponsor segments. The CSV is around 5GB.
  • videoInfo.csv : Contains the video title, publication date, and channel ID associated with each video.

Before we get into it, these are all the libraries I ended up using. I will explain the less obvious ones as we go.

pandas
duckdb
requests
requests-cache
python-dotenv
seaborn
matplotlib
numpy

The first step, then, is to load the data. Surprisingly, this was already a bit challenging, as I was getting a lot of errors parsing some rows of the CSV. These were the settings I found to work for the majority of the rows:

import duckdb
import os

# Connect to an in-memory DuckDB instance
con = duckdb.connect(database=':memory:')

sponsor_times = con.read_csv(
    "sb-mirror/sponsorTimes.csv",
    header=True,
    columns={
        "videoID": "VARCHAR",
        "startTime": "DOUBLE",
        "endTime": "DOUBLE",
        "votes": "INTEGER",
        "locked": "INTEGER",
        "incorrectVotes": "INTEGER",
        "UUID": "VARCHAR",
        "userID": "VARCHAR",
        "timeSubmitted": "DOUBLE",
        "views": "INTEGER",
        "category": "VARCHAR",
        "actionType": "VARCHAR",
        "service": "VARCHAR",
        "videoDuration": "DOUBLE",
        "hidden": "INTEGER",
        "reputation": "DOUBLE",
        "shadowHidden": "INTEGER",
        "hashedVideoID": "VARCHAR",
        "userAgent": "VARCHAR",
        "description": "VARCHAR",
    },
    ignore_errors=True,
    quotechar="",
)

video_info = con.read_csv(
    "sb-mirror/videoInfo.csv",
    header=True,
    columns={
        "videoID": "VARCHAR",
        "channelID": "VARCHAR",
        "title": "VARCHAR",
        "published": "DOUBLE",
    },
    ignore_errors=True,
    quotechar=None,
)

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Here is what a sample of the data looks like:

con.sql("SELECT videoID, startTime, endTime, votes, locked, category FROM sponsor_times LIMIT 5")

con.sql("SELECT * FROM video_info LIMIT 5")
Sample of sponsorTimes.csv
Sample of videoInfo.csv

Understanding the data in the sponsorTimes table is ridiculously important, otherwise, the cleaning process won’t make any sense.

Each row represents a user-submitted timestamp for a sponsored segment. Since multiple users can submit segments for the same video, the dataset contains duplicate and potentially incorrect entries, which will need to be dealt with during cleaning.

To find incorrect segments, I will use the votes and the locked column, as the latter one represents segments that were confirmed to be correct. 

Another important column is the category. There are a bunch of categories like Intro, Outro, Filler, etc. For this analysis, I will only work with Sponsor and Self-Promo.

I started by applying some filters:

CREATE TABLE filtered AS
SELECT
    *
FROM sponsor_times
WHERE category IN ('sponsor', 'selfpromo') AND (votes > 0 OR locked=1)

Filtering for locked segments or segments with more than 0 votes was a big decision. This reduced the dataset by a huge percentage, but doing so made the data very reliable. For example, before doing this, all of the Top 50 channels with the highest percentage of ads were just spam, random channels that ran 99.9% of ads.

With this done, the next step is to get a dataset where each sponsor segment shows up only once. For example, a video with a sponsor segment at the beginning and another at the end should have only two rows of data.

This is very much not the case so far, since in one video we can have multiple user-submitted entries for each segment. To do this, I will use window functions to identify if two or more rows of data represent the same segment. 

The first window function compares the startTime of one row with the endTime of the previous. If these values don’t overlap, it means they are entries for separate segments, otherwise they are repeated entries for the same segment. 

CREATE TABLE new_segments AS
SELECT
    -- Coalesce to TRUE to deal with the first row of every window
    -- as the values are NULL, but it should count as a new segment.
    COALESCE(startTime > LAG(endTime) 
      OVER (PARTITION BY videoID ORDER BY startTime), true) 
      AS new_ad_segment,
    *
FROM filtered
Window Function example for a single video.

The new_ad_segment column is TRUE every time a row represents a new segment of a video. The first two rows, as their timestamps overlap, are properly marked as the same segment.

Next up, the second window function will label each ad segment by number:

CREATE TABLE ad_segments AS
SELECT
    SUM(new_ad_segment) 
      OVER (PARTITION BY videoID ORDER BY startTime)
      AS ad_segment,
    *
FROM new_segments
Example of labels for ad segments for a single video.

Finally, now that each segment is properly numbered, it’s easy to get the segment that is either locked or has the highest amount of votes.

CREATE TABLE unique_segments AS
SELECT DISTINCT ON (videoID, ad_segment)
    *
FROM ad_segments
ORDER BY videoID, ad_segment, locked DESC, votes DESC
Example of what the final dataset looks like for a single video.

That’s it! Now this table has one row for each unique ad segment, and I can start exploring the data.

If these queries feel complicated, and you need a refresher on window functions, check out this blog post that will teach you all you need to know about them! The last example covered in the blog post is almost exactly the process I used here.

Exploring and Enhancing the Data

Finally, the dataset is good enough to start exploring. The first thing I did was to get a sense of the size of the data:

  • 36.0k Unique Channels
  • 552.6k Unique Videos
  • 673.8k Unique Sponsor Segments, for an average of 1.22 segments per video

As mentioned earlier, filtering by segments that were either locked or had at least 1 upvote, reduced the dataset massively, by around 80%. But this is the price I had to pay to have data that I could work with.

To check if there is nothing immediately wrong with the data, I gathered the channels that have the most amount of videos:

CREATE TABLE top_5_channels AS 
SELECT
    channelID,
    count(DISTINCT unique_segments.videoID) AS video_count
FROM
    unique_segments
    LEFT JOIN video_info ON unique_segments.videoID = video_info.videoID 
WHERE
    channelID IS NOT NULL
    -- Some channel IDs are blank
    AND channelID != '""'
GROUP BY
    channelID
ORDER BY
    video_count DESC
LIMIT 5

The amount of videos per channel looks realistic… But this is terrible to work with. I don’t want to go to my browser and look up channel IDs every time I want to know the name of a channel.

To fix this, I created a small script with functions to get these values from the YouTube API in Python. I’m using the library requests_cache to make sure I won’t be repeating API calls and depleting the API limits.

import requests
import requests_cache
from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv("YT_API_KEY")

# Cache responses indefinitely
requests_cache.install_cache("youtube_cache", expire_after=None)

def get_channel_name(channel_id: str) -> str:
    url = (
        f"https://www.googleapis.com/youtube/v3/channels"
        f"?part=snippet&id={channel_id}&key={API_KEY}"
    )
    response = requests.get(url)
    data = response.json()

    try:
        return data.get("items", [])[0].get("snippet", {}).get("title", "")
    except (IndexError, AttributeError):
        return ""

Besides this, I also created very similar functions to get the country and thumbnail of each channel, which will be useful later. If you’re interested in the code, check the GitHub repo.

On my DuckDB code, I’m now able to register this Python function and call them within SQL! I just need to be very careful to always use them on aggregated and filtered data, otherwise, I can say bye-bye to my API quota.

# This the script created above
from youtube_api import get_channel_name

# Try registering the function, ignore if already exists
try:
    con.create_function('get_channel_name', get_channel_name, [str], str)
except Exception as e:
    print(f"Skipping function registration (possibly already exists): {e}")

# Get the channel names
channel_names = con.sql("""
    select
        channelID,
        get_channel_name(channelID) as channel_name,
        video_count
    from top_5_channels
""")

Much better! I looked up two channels that I’m familiar with on YouTube for a quick sanity check. Linus Tech Tips has a total of 7.2k videos uploaded, with 2.3k present in this dataset. Gamers Nexus has 3k videos, with 700 in the dataset. Looks good enough for me!

The last thing to do, before moving over to actually answering the question I set myself to answer, is to have an idea of the average duration of videos. 

This matches my expectations, for the most part. I’m still a bit surprised by the amount of 20-40-minute videos, as for many years the “meta” was to have videos of 10 minutes to maximize YouTube’s own ads. 

Also, I thought those buckets of video durations used in the previous graph were quite representative of how I think about video lengths, so I will be sticking with them for the next sections.

For reference, this is the pandas code used to create those buckets.

video_lengths = con.sql("""
  SELECT DISTINCT ON (videoID)
      videoID,
      videoDuration
  FROM
      unique_segments
  WHERE
      videoID IS NOT NULL
      AND videoDuration > 0
"""
).df()

# Define custom bins, in minutes
bins = [0, 3, 7, 12, 20, 40, 90, 180, 600, 9999999] 
labels = ["0-3", "3-7", "7-12", "12-20", "20-40", "40-90", "90-180", "180-600", "600+"]

# Assign each video to a bucket (trasnform duration to min)
video_lengths["duration_bucket"] = pd.cut(video_lengths["videoDuration"] / 60, bins=bins, labels=labels, right=False)

Have Sponsor Segments Increased Over the Years?

The big question. This will prove if I’m being paranoid or not about everyone trying to sell me something at all times. I will start, though, by answering a simpler question, which is the percentage of sponsors for different video durations.

My expectation is that shorter videos have a higher share of their runtime from sponsors in comparison to longer videos. Let’s check if this is actually the case.

CREATE TABLE video_total_ads AS
SELECT
    videoID,
    MAX(videoDuration) AS videoDuration,
    SUM(endTime - startTime) AS total_ad_duration,
    SUM(endTime - startTime) / 60 AS ad_minutes,
    SUM(endTime - startTime) / MAX(videoDuration) AS ad_percentage,
    MAX(videoDuration) / 60 AS video_duration_minutes
FROM
    unique_segments
WHERE
    videoDuration > 0
    AND videoDuration < 5400
    AND videoID IS NOT NULL
GROUP BY
    videoID

To keep the visualization simple, I’m applying similar buckets, but only up to 90 minutes.

# Define duration buckets (in minutes, up to 90min)
bins = [0, 3, 7, 12, 20, 30, 40, 60, 90]    
labels = ["0-3", "3-7", "7-12", "12-20", "20-30", "30-40", "40-60", "60-90"]

video_total_ads = video_total_ads.df()

# Apply the buckets again
video_total_ads["duration_bucket"] = pd.cut(video_total_ads["videoDuration"] / 60, bins=bins, labels=labels, right=False)

# Group by bucket and sum ad times and total durations
bucket_data = video_total_ads.groupby("duration_bucket")[["ad_minutes", "videoDuration"]].sum()

# Convert to percentage of total video time
bucket_data["ad_percentage"] = (bucket_data["ad_minutes"] / (bucket_data["videoDuration"] / 60)) * 100
bucket_data["video_percentage"] = 100 - bucket_data["ad_percentage"]

As expected, if you’re watching shorter-form content on YouTube, then around 10% of it is sponsored! Videos of 12–20 min in duration have 6.5% of sponsors, while 20–30 min have only 4.8%.

To move forward to the year-by-year analysis I need to join the sponsor times with the videoInfo table.

CREATE TABLE video_total_ads_joined AS
SELECT
    *
FROM
    video_total_ads
LEFT JOIN video_info ON video_total_ads.videoID = video_info.videoID

Next, let’s just check how many videos we have per year:

SELECT
    *,
    to_timestamp(NULLIF (published, 0)) AS published_date,
    extract(year FROM to_timestamp(NULLIF (published, 0))) AS published_year
FROM
    video_total_ads

Not good, not good at all. I’m not exactly sure why but there are a lot of videos that didn’t have the timestamp recorded. It seems that only in 2021 and 2022 videos were reliably stored with their published date.

I do have some ideas on how I can improve this dataset with other public data, but it’s a very time-consuming process and I will leave this for a future blog post. I don’t intend to settle for an answer based on limited data, but for now, I will have to make do with what I have.

I chose to keep the analysis between the years 2018 and 2023, given that those years had more data points.

# Limiting the years as for these here I have a decent amount of data.
start_year = 2018
end_year = 2023

plot_df = (
    video_total_ads_joined.df()
    .query(f"published_year >= {start_year} and published_year <= {end_year}")
    .groupby(["published_year", "duration_bucket"], as_index=False)
    [["ad_minutes", "video_duration_minutes"]]
    .sum()
)

# Calculate ad_percentage & content_percentage
plot_df["ad_percentage"] = (
    plot_df["ad_minutes"] / plot_df["video_duration_minutes"] * 100
)
plot_df["content_percentage"] = 100 - plot_df["ad_percentage"]

There is a steep increase in ad percentage, especially from 2020 to 2021, but afterward, it plateaus, especially for longer videos. This makes a lot of sense since during those years online advertisement grew a lot as people spent more and more time at home. 

For shorter videos, there does seem to be an increase from 2022 to 2023. But as the data is limited, and I don’t have data for 2024, I can’t get a conclusive answer to this. 

Next up, let’s move into questions that don’t depend on the publishing date, this way I can work with a larger portion of the dataset.

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

This is a fun one for me, as I wonder if the channels I actively watch are the ones that run the most ads. 

Continuing from the table created previously, I can easily group the ad and video amount by channel:

CREATE TABLE ad_percentage_per_channel AS
SELECT
    channelID,
    sum(ad_minutes) AS channel_total_ad_minutes,
    sum(videoDuration) / 60 AS channel_total_video_minutes
FROM
    video_total_ads_joined
GROUP BY
    channelID

I decided to filter for channels that had at least 30 minutes of videos in the data, as a way of eliminating outliers.

SELECT
    channelID,
    channel_total_video_minutes,
    channel_total_ad_minutes,
    channel_ad_percentage
FROM
    ad_percentage_per_channel
WHERE
    -- At least 30 minutes of video
    channel_total_video_minutes > 1800
    AND channelID IS NOT NULL
ORDER BY
    channel_ad_percentage DESC
LIMIT 50

As quickly mentioned earlier, I also created some functions to get the country and thumbnail of channels. This allowed me to create this visualization.

I’m not sure if this surprised me or not. Some of the channels on this list I watch very frequently, especially Gaveta (#31), a Brazilian YouTuber who covers movies and film editing.

I also know that both he and Corridor Crew (#32) do a lot of self-sponsor, promoting their own content and products, so maybe this is also the case for other channels! 

In any case, the data seems good, and the percentages seem to match my manual checks and personal experience.

I would love to know if channels that you watch were present in this list, and if it surprised you or not!

If you want to see the Top 150 Creators, subscribe to my free newsletter, as I will be publishing the full list as well as more information about this analysis in there!

Have you ever thought about at which point of the video ads work best? People probably just skip sponsor segments placed at the beginning, and just move on and close the video for those placed at the end.

From personal experience, I feel that I’m more likely to watch an ad if it plays around the middle of a video, but I don’t think this is what creators do in most cases.

My goal, then, is to create a heatmap that shows the density of ads during a video runtime. Doing this was surprisingly not obvious, and the solution that I found was so clever that it kinda blew my mind. Let me show you.

This is the data needed for this analysis. One row per ad, with the timestamp when each segment starts and ends:

The first step is to normalize the intervals, e.g., I don’t care that an ad started at 63s, what I want to know is if it started at 1% of the video runtime or 50% of the video runtime.

CREATE TABLE ad_intervals AS
SELECT
    videoID,
    startTime,
    endTime,
    videoDuration,
    startTime / videoDuration AS start_fraction,
    endTime / videoDuration AS end_fraction
FROM
    unique_segments
WHERE
    -- Just to make sure we don't have bad data
    videoID IS NOT NULL
    AND startTime >= 0
    AND endTime <= videoDuration
    AND startTime < endTime
    -- Less than 40h
    AND videoDuration < 144000

Great, now all intervals are comparable, but the problem is far from solved.

I want you to think, how would you solve this? If I asked you “At 10% runtime out of all videos, how many ads are running?”

I do not believe that this is an obvious problem to solve. My first instinct was to create a bunch of buckets, and then, for each row, I would ask “Is there an ad running at 1% of the runtime? What about at 2%? And so on…”

This seemed like a terrible idea, though. I wouldn’t be able to do it in SQL, and the code to solve it would be incredibly messy. In the end, the implementation of the solution I found was remarkably simple, using the Sweep Line Algorithm, which is an algorithm that is often used in programming interviews and puzzles.

I will show you how I solved it but don’t worry if you don’t understand what is happening. I will share other resources for you to learn more about it later on.

The first thing to do is to transform each interval (startTime, endTime) into two events, one that will count as +1 when the ad starts, and another that will count as -1 when the ad finishes. Afterward, just order the dataset by the “start time”.

CREATE TABLE ad_events AS
WITH unioned as (
  -- This is the most important step.
  SELECT
      videoID,
      start_fraction as fraction,
      1 as delta
  FROM ad_intervals
  UNION ALL
  SELECT
      videoID,
      end_fraction as fraction,
      -1 as delta
  FROM ad_intervals
), ordered AS (
  SELECT
      videoID,
      fraction,
      delta
  FROM ad_events
  ORDER BY fraction, delta
)
SELECT * FROM ordered

Now it’s already much easier to see the path forward! All I have to do is use a running sum on the delta column, and then, at any point of the dataset, I can know how many ads are running! 

For example, if from 0s to 10s three ads started, but two of those also finished, I would have a delta of +3 and then -2, which means that there is only one ad currently running!

Going forward, and to simplify the data a bit, I first round the fractions to 4 decimal points and aggregate them. This is not necessary, but having too many rows was a problem when trying to plot the data. Finally, I divide the amount of running ads by the total amount of videos, to have it as a percentage.

CREATE TABLE ad_counter AS 
WITH rounded_and_grouped AS (
  SELECT
      ROUND(fraction, 4) as fraction,
      SUM(delta) as delta
  FROM ad_events
  GROUP BY ROUND(fraction, 4)
  ORDER BY fraction
), running_sum AS (
  SELECT
      fraction,
      SUM(delta) OVER (ORDER BY fraction) as ad_counter
  FROM rounded_and_grouped
), density AS (
  SELECT
      fraction,
      ad_counter,
      ad_counter / (SELECT COUNT(DISTINCT videoID) FROM unique_segments_filtered) as density
  FROM running_sum
)
SELECT * FROM density

With this data not only do I know that at the beginning of the videos (0.0% fraction), there are 69987 videos running ads, this also represents 17% of all videos in the dataset.

Now I can finally plot it as a heatmap:

As expected, the bumps at the extremities show that it’s way more common for channels to run ads at the beginning and end of the video. It’s also interesting that there is a plateau around the middle of the video, but then a drop, as the second half of the video is generally more ad-free.

What I found funny is that it’s apparently common for some videos to start straight away with an ad. I couldn’t picture this, so I manually checked 10 videos and it’s actually true… I’m not sure how representative it is, but most of the ones that I opened were gaming-related and in Russian, and they started directly with ads!

Before we move on to the conclusions, what did you think of the solution to this problem? I was surprised at how simple was doing this with the Sweep Line trick. If you want to know more about it, I recently published a blog post covering some SQL Patterns, and the last one is exactly this problem! Just repackaged in the context of counting concurrent meetings.

Conclusion

I really enjoyed doing this analysis since the data feels very personal to me, especially because I’ve been addicted to YouTube lately. I also feel that the answers I found were quite satisfactory, at least for the most part. To finish it off, let’s do a last recap!

Have Sponsor Segments Increased Over the Years?

There was a clear increase from 2020 to 2021. This was an effect that happened throughout all digital media and it’s clearly shown in this data. In more recent years, I can’t say whether there was an increase or not, as I don’t have enough data to be confident. 

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

I got to create a very convincing list of the Top 50 channels that run the highest amount of ads. And I discovered that some of my favorite creators are the ones that spend the most amount of time trying to sell me something!

What is the density of sponsor segments throughout a video?

As expected, most people run ads at the beginning and the end of videos. Besides this, a lot of creators run ads around the middle of the video, making the second half slightly more ad-free. 

Also, there are YouTubers who immediately start a video with ads, which I think it’s a crazy strategy. 

Other Learnings and Next Steps

I liked how clear the data was in showing the percentage of ads in different video sizes. Now I know that I’m probably spending 5–6% of my time on YouTube watching ads if I’m not skipping them since I mostly watch videos that are 10–20 min.

I’m still not fully happy though with the year-by-year analysis. I’ve already looked into other data and downloaded more than 100 GB of YouTube metadata datasets. I’m confident that I can use it, together with the YouTube API, to fill some gaps and get a more convincing answer to my question.

Visualization Code

You might have noticed that I didn’t provide snippets to plot the charts shown here. This was on purpose to make the blog post more readable, as matplotlib code occupies a lot of space.

You can find all the code in my GitHub repo, that way you can copy my charts if you want to.


That’s it for this one! I really hope you enjoyed reading this blog post and learned something new!

If you’re curious about interesting topics that didn’t make it into this post, or enjoy learning about data, subscribe to my free newsletter on Substack. I publish whenever I have something genuinely interesting to share.

Want to connect directly or have questions? Reach out anytime at mtrentz.com.

All images and animations by the author unless stated otherwise.

The post Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data appeared first on Towards Data Science.

]]>
Kernel Case Study: Flash Attention https://towardsdatascience.com/kernel-case-study-flash-attention/ Thu, 03 Apr 2025 18:53:44 +0000 https://towardsdatascience.com/?p=605403 Understanding all versions of flash attention through a triton implementation

The post Kernel Case Study: Flash Attention appeared first on Towards Data Science.

]]>
The attention mechanism is at the core of modern day transformers. But scaling the context window of these transformers was a major challenge, and it still is even though we are in the era of a million tokens + context window (Qwen 2.5 [1]). There are both considerable compute and memory bound complexities in these models when we scale the context window (A naive Attention Mechanism scales quadratically in both compute and memory requirements). Revisiting Flash Attention lets us understand the complexities of optimizing the underlying operations on GPUs and more importantly gives us a better grip on thinking what’s next.

Let’s quickly revisit a naive attention algorithm to see what’s going on.

Attention Algorithm. Image by Author

As you can see if we are not being careful then we will end up materializing a full NxM attention matrix into the GPU HBM. Meaning the memory requirement will go up quadratically to increasing context length.

If you wanna learn more about the GPU memory hierarchy and its differences, my previous post on Triton is a good starting point. This would also be handy as we go along in this post when we get to implementing the Flash Attention kernel in triton. The flash attention paper also has some really good introduction to this.

Additionally, when we look at the steps involved in executing this algorithm and its pattern of accessing the slow HBM, (which as explained later in the post could be a major bottleneck as well) we notice a few things:

  1. We have Q, K and V in the HBM initially
  2. We need to access Q and K initially from the HBM to compute the dot product
  3. We write the output scores back to the HBM
  4. We access it again to execute the softmax, and optionally for Causal attention, like in the case of LLMs, we will have to mask this output before the softmax. The resulting full attention matrix is written again into the HBM
  5. We access the HBM again to execute the final dot product, to get both the attention weights and the Value matrix to write the output back to the slow GPU memory

I think you get the point. We could smartly read and write from the HBM to avoid redundant operations, to make some potential gains. This is exactly the primary motivation for the original Flash Attention algorithm.

Flash Attention initially came out in 2022 [2], and then a year later came out with some much needed improvements in 2023 as Flash Attention v2 [3] and again in 2024 with additional improvements for Nvidia Hopper and Blackwell GPUs [4] as Flash Attention v3 [5]. The original attention paper identified that the attention operation is still limited by memory bandwidth rather than compute. (In the past, there have been attempts to reduce the computation complexity of Attention from O(N**2) to O(NlogN) and lower through approximate algorithms)

Flash attention proposed a fused kernel which does all of the above attention operations in one go, block-wise, to get the final attention output without ever having to realize the full N**2 attention matrix in memory, making the algorithm significantly faster. The term `fused` simply means we combine multiple operations in the GPU SRAM before invoking the much slower journey across the slower GPU memory, making the algorithm performant. All the while providing the exact attention output without any approximations.

This lecture, from Stanford CS139, demonstrates brilliantly how we can think of the impact of a well thought out memory access pattern can have on an algorithm. I highly recommend you check this one out if you haven’t already.

Before we start diving into flash attention (it’s getting tedious to type this over and over so let’s agree to call it FA, shall we?) in triton there is something else that I wanted to get out of the way.

Numerical Stability in exponents

Let’s take the example of FP32 numbers. float32 (standard 32-bit float) uses 1 sign bit, 8 exponent bits, and 23 mantissa bits [6]. The largest finite base for the exponent in float32 is 2127≈1.7×1038. Which implies when we look at exponents, e88 ≈ 1.65×1038, anything close to 88 (although in reality would be much lower to keep it safe) and we are in trouble as we could easily overflow. Here’s a very interesting chat with OpenAI o1 shared by folks at AllenAI in their OpenInstruct repo. This although is talking about stabilizing KL Divergence calculations in the setting of RLHF/RL, the ideas translate exactly to exponents as well. So to deal with the softmax situation in attention what we do is the following:

Softmax with rescaling. Image by Author

TRICK : Let’s also observe the following, if you do this:

Rescaling Trick. Image by Author

then you can rescale/readjust values without affecting the final softmax value. This is really useful when you have an initial estimate for the maximum value, but that might change when we encounter a new set of values. I know I know, stay with me and let me explain.

Setting the scene

Let’s take a small detour into matrix multiplication.

Blocked Matrix Multiplication. Image by Author

This shows a toy example of a blocked matrix multiplication except we have blocks only on the rows of A (green) and columns of B (Orange? Beige?). As you can see above the output O1, O2, O3 and O4 are complete (those positions need no more calculations). We just need to fill in the remaining columns in the initial rows by using the remaining columns of B. Like below:

Next set of block fill the remaining spaces up. Image by Author

So we can fill these places in the output with a block of columns from B and a block of rows from A at a time.

Connecting the dots

When I introduced FA, I said that we never have to compute the full attention matrix and store the whole thing. So here’s what we do:

  1. Compute a block of the attention matrix using a block of rows from Q and a block of columns from K. Once you get the partial attention matrix compute a few statistics and keep it in the memory.
Computing block attention scores S_b, and computing the row-wise maximums. Image by Author

I have greyed O5 to O12 because we don’t know those values yet, as they need to come from the subsequent blocks. We then transform Sb like below:

Keeping a track of the current row-sum and row-maxes. Image by Author
Exponents with the scaling trick. Image by Author

Now you have the setup for a partial softmax

Partial Softmax, as the denominator is still a partial sum. Image by Author

But:

  1. What if the true maximum is in the Oi’s that are yet to come?
  2. The sum is still local, so we need to update this every time we see new Pi’s. We know how to keep track of a sum, but what about rebasing it to the true maximum?

Recall the trick above. All that we have to do is to keep a track of the maximum values we encounter for each row, and iteratively update as you see new maximums from the remaining blocks of columns from K for the same set of rows from Q.

Two consecutive blocks and its row max manipulations. Image by Author
Updating the estimate of our current sum with rescaling

We still do not want to write our partial softmax matrix into HBM. We keep it for the next step.

The final dot product

The last step in our attention computation is our dot product with V. To start we would have initialized a matrix full of 0’s in our HBM as our output of shape NxD. Where N is the number of Queries as above. We use the same block size for V as we had for K except we can apply it row wise like below (The subscripts just denote that this is only a block and not the full matrix)

A single block of attention scores creating a partial output. Image by Author
Whereas the full output would require the sum of all these dot products. Some of which will be filled in by the blocks to come. Image by Author

Notice how we need the attention scores from all the blocks to get the final product. But if we calculate the local score and `accumulate` it like how we did to get the actual Ls we can form the full output at the end of processing all the blocks of columns (Kb) for a given row block (Qb).

Putting it all together

Let’s put all these ideas together to form the final algorithm

Flash Attention V1 Algorithm. Source: Tri Dao et.al [2]

To understand the notation, _ij implies that it is the local values for a given block of columns and rows and _i implies it’s for the global output rows and Query blocks. The only part we haven’t explained so far is the final update to Oi. That’s where we use all the ideas from above to get the right scaling.

The whole code is available as a gist here.

Let’s see what these initializations look like in torch:

def flash_attn_v1(Q, K, V, Br, Bc):
  """Flash Attention V1"""
  B, N, D = Q.shape
  M = K.shape[1]
  Nr = int(np.ceil(N/Br))
  Nc = int(np.ceil(N/Bc))
  
  Q = Q.to('cuda')
  K = K.to('cuda')
  V = V.to('cuda')
  
  batch_stride = Q.stride(0)
  
  O = torch.zeros_like(Q).to('cuda')
  lis = torch.zeros((B, Nr, int(Br)), dtype=torch.float32).to('cuda')
  mis = torch.ones((B, Nr, int(Br)), dtype=torch.float32).to('cuda')*-torch.inf
  
  grid = (B, )
  flash_attn_v1_kernel[grid](
      Q, K, V,
      N, M, D,
      Br, Bc,
      Nr, Nc,
      batch_stride,
      Q.stride(1),
      K.stride(1),
      V.stride(1),
      lis, mis,
      O,
      O.stride(1),
  )
  return O

If you are unsure about the launch grid, checkout my introduction to Triton

Take a closer look at how we initialized our Ls and Ms. We are keeping one for each row block of Output/Query, each of size Br. There are Nr such blocks in total.

In the example above I was simply using Br = 2 and Bc = 2. But in the above code the initialization is based on the device capacity. I have included the calculation for a T4 GPU. For any other GPU, we need to get the SRAM capacity and adjust these numbers accordingly. Now for the actual kernel implementation:

# Flash Attention V1
import triton
import triton.language as tl
import torch
import numpy as np
import pdb

@triton.jit
def flash_attn_v1_kernel(
    Q, K, V,
    N: tl.constexpr, M: tl.constexpr, D: tl.constexpr,
    Br: tl.constexpr,
    Bc: tl.constexpr,
    Nr: tl.constexpr,
    Nc: tl.constexpr,
    batch_stride: tl.constexpr,
    q_rstride: tl.constexpr,
    k_rstride: tl.constexpr, 
    v_rstride: tl.constexpr,
    lis, mis,
    O,
    o_rstride: tl.constexpr):
    
    """Flash Attention V1 kernel"""
    
    pid = tl.program_id(0)
    

    for j in range(Nc):
        k_offset = ((tl.arange(0, Bc) + j*Bc) * k_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * M * D
        # Using k_rstride and v_rstride as we are looking at the entire row at once, for each k v block 
        v_offset = ((tl.arange(0, Bc) + j*Bc) * v_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * M * D
        k_mask = k_offset < (pid + 1) * M*D
        v_mask = v_offset < (pid + 1) * M*D
        k_load = tl.load(K + k_offset, mask=k_mask, other=0)
        v_load = tl.load(V + v_offset, mask=v_mask, other=0)
        for i in range(Nr):
            q_offset = ((tl.arange(0, Br) + i*Br) * q_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * N * D
            q_mask = q_offset < (pid + 1) * N*D
            q_load = tl.load(Q + q_offset, mask=q_mask, other=0)
            # Compute attention
            s_ij = tl.dot(q_load, tl.trans(k_load))
            m_ij = tl.max(s_ij, axis=1, keep_dims=True)
            p_ij = tl.exp(s_ij - m_ij)
            l_ij = tl.sum(p_ij, axis=1, keep_dims=True)
            
            ml_offset = tl.arange(0, Br) + Br * i + pid * Nr * Br
            m = tl.load(mis + ml_offset)[:, None]
            l = tl.load(lis + ml_offset)[:, None]

            m_new = tl.where(m < m_ij, m_ij, m)

            l_new = tl.exp(m - m_new) * l + tl.exp(m_ij - m_new) * l_ij

            o_ij = tl.dot(p_ij, v_load)

            output_offset = ((tl.arange(0, Br) + i*Br) * o_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * N * D
            output_mask = output_offset < (pid + 1) * N*D
            o_current = tl.load(O + output_offset, mask=output_mask)

            o_new = (1/l_new) * (l * tl.exp(m - m_new) * o_current + tl.exp(m_ij - m_new) * o_ij)

            tl.store(O + output_offset, o_new, mask=output_mask)
            tl.store(mis + ml_offset, tl.reshape(m_new, (Br,)))
            tl.store(lis + ml_offset, tl.reshape(l_new, (Br,)))

Let’s understand whats happening here:

  1. Create 1 kernel for each NxD matrix in the batch. In reality we would have one more dimension to parallelize across, the head dimension. But for understanding the implementation I think this would suffice.
  2. In each kernel we do the following:
    1. For each block of columns in K and V we load up the relevant part of the matrix (Bc x D) into the GPU SRAM (Current total SRAM usage = 2BcD). This stays in the SRAM till we are done with all the row blocks
    2. For each row block of Q, we load the block onto SRAM as well (Current total SRAM Usage = 2BcD + BrD)
    3. On chip we compute the dot product (sij), compute the local row-maxes (mij), the exp (pij), and the expsum (lij)
    4. We load up the running stats for the ith row block. Two vectors of size Br x 1, which denotes the current global row-maxes (mi) and the expsum (li). (Current SRAM usage: 2BcD + BrD + 2Br)
    5. We get the new estimates for the global mi and li.
    6. We load the part of the output for this block of Q and update it using the new running stats and the exponent trick, we then write this back into the HBM. (Current SRAM usage: 2BcD + 2BrD + 2Br)
    7. We write the updated running stats also into the HBM.
  3. For a matrix of any size, aka any context length, at a time we will never materialize the full attention matrix, only a part of it always.
  4. We managed to fuse together all the ops into a single kernel, reducing HBM access considerably.

Final SRAM usage stands although at 4BD + 2B, where B was initially calculated as M/4d where M is the SRAM capacity. Not sure if am missing something here. Please comment if you know why this is the case!

Block Sparse Attention and V2 and V3

I will keep this short as these versions keep the core idea but figured out better and better ways to do the same.

For Block Sparse Attention,

  1. Consider we had masks for each block like in the case of causal attention. If for a given block we have the masks all set to zero then we can simply skip the entire block without computing anything really. Saving FLOPs. This is where the major gains were seen. To put this into perspective, in the case of BERT pre-training the algorithm gets a 15% boost over the best performing training setup at the time, whereas for GPT-2 we get a 3x over huggingface training implementation and ~ 2x over a Megatron setup.
Performance gain for autoregressive models, where we have a sparse mask. Source: Tri Dao et.al [2]

2. You can literally get the same performance in GPT2 in a fraction of the time, literally shaving off days from the training run, which is awesome!

In V2:

  1. Notice how currently we can only do parallelization at the batch and head dimension. But if you simply just flip the order to look at all the column blocks for a given row block then we get the following advantages:
    1. Each row block becomes embarrassingly parallel. How you know this is by looking at the illustrations above. You need all the column blocks for a given row block to fully form the attention output. If you were to run all the column blocks in parallel, you will end up with a race condition that will try to update the same rows of the output at the same time. But not if you do it the other way around. Although there are atomic add operators in triton which could help, they may potentially set us back.
    2. We can avoid hitting the HBM to get the global Ms and Ls. We can initialize one on the chip for each kernel.
    3. Also we do not have to scale all the output update terms with the new estimate of L. We can just compute stuff without dividing by L and at the end of all the column blocks, simply divide the output with the latest estimate of L, saving some FLOPS again!
  2. Much of the improvement also comes in the form of the backward kernel. I am omitting all the backward kernels from this. But they are a fun exercise to try and implement, although they are significantly more complex.

Here are some benchmarks:

Performance benchmark of FA v2 against existing attention algorithms. Source: Tri Dao et.al [3]

The actual implementations of these kernels need to take into account various nuances that we encounter in the real world. I have tried to keep it simple. But do check them out here.

More recently in V3:

  1. Newer GPUs, especially the Hopper and Blackwell GPUs, have low precision modes (FP8 in Hopper and GP4 in Blackwell), which can double and quadruple the throughput for the same power and chip area and more specialized GEMM (General Matrix Multiply) kernels, which the previous version of the algorithm fails to capitalize on. This is because there are many operations which are non-GEMM, like softmax, which reduces the utilization of these specialized GPU kernels.
  2. The FA v1 and v2 are essentially synchronous. Recall in the v2 description I mentioned that we are limited when column blocks try to write to the same output pointers, or when we have to go step by step using the output from the previous steps. Well these modern GPUs can make use special instructions to break this synchrony.

We overlap the comparatively low-throughput non-GEMM operations involved in softmax, such as floating point multiply-add and exponential, with the asynchronous WGMMA instructions for GEMM. As part of this, we rework the FlashAttention-2 algorithm to circumvent certain sequential dependencies between softmax and the GEMMs. For example, in the 2-stage version of our algorithm, while softmax executes on one block of the scores matrix, WGMMA executes in the asynchronous proxy to compute the next block.

Flash Attention v3, Shah et.al
  1. They also adapted the algorithm to target these specialized low precision Tensor cores on these new devices, significantly increasing the FLOPs.

Some more benchmarks:

FA v3 Performance gain over v2. Source: Shah et. al [5]

Conclusion

There is much to admire in their work here. The floor for this technical skill level often seemed high owing to the low level details. But hopefully tools like Triton could change the game and get more people into this! The future is bright.

References

[1] Qwen 2.5-7B-Instruct-1M Huggingface Model Page

[2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

[3] Tri Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

[4] NVIDIA Hopper Architecture Page

[5] Jay ShahGanesh BikshandiYing ZhangVijay ThakkarPradeep RamaniTri Dao, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

[6] Single-precision floating-point format, Wikipedia

The post Kernel Case Study: Flash Attention appeared first on Towards Data Science.

]]>
Agentic GraphRAG for Commercial Contracts https://towardsdatascience.com/agentic-graphrag-for-commercial-contracts/ Thu, 03 Apr 2025 04:27:24 +0000 https://towardsdatascience.com/?p=605397 Structuring legal information as a knowledge graph to increase the answer accuracy using a LangGraph agent

The post Agentic GraphRAG for Commercial Contracts appeared first on Towards Data Science.

]]>
In every business, legal contracts are foundational documents that define the relationships, obligations, and responsibilities between parties. Whether it’s a partnership agreement, an NDA, or a supplier contract, these documents often contain critical information that drives decision-making, risk management, and compliance. However, navigating and extracting insights from these contracts can be a complex and time-consuming process.

In this post, we’ll explore how we can streamline the process of understanding and working with legal contracts by implementing an end-to-end solution using Agentic Graphrag. I see GraphRAG as an umbrella term for any method that retrieves or reasons over information stored in a knowledge graph, enabling more structured and context-aware responses. 

By structuring legal contracts into a knowledge graph in Neo4j, we can create a powerful repository of information that’s easy to query and analyze. From there, we’ll build a LangGraph agent that allows users to ask specific questions about the contracts, making it possible to rapidly uncover new insights.

The code is available in this GitHub repository.

Why structuring data matters

Some domains work well with naive RAG, but legal contracts present unique challenges.

Pulling information from irrelevant contracts using naive vector RAG

As shown in the image, relying solely on a vector index to retrieve relevant chunks can introduce risks, such as pulling information from irrelevant contracts. This is because legal language is highly structured, and similar wording across different agreements can lead to incorrect or misleading retrieval. These limitations highlight the need for a more structured approach, such as GraphRAG, to ensure precise and context-aware retrieval.

To implement GraphRAG, we first need to construct a knowledge graph.

Legal knowledge graph containing both structured and unstructured information.

To build a knowledge graph for legal contracts, we need a way to extract structured information from documents and store it alongside the raw text. An LLM can help by reading through contracts and identifying key details such as parties, dates, contract types, and important clauses. Instead of treating the contract as just a block of text, we break it down into structured components that reflect its underlying legal meaning. For example, an LLM can recognize that “ACME Inc. agrees to pay $10,000 per month starting January 1, 2024” contains both a payment obligation and a start date, which we can then store in a structured format.

Once we have this structured data, we store it in a knowledge graph, where entities like companies, agreements, and clauses are represented as represented along with their relationships. The unstructured text remains available, but now we can use the structured layer to refine our searches and make retrieval far more precise. Instead of just fetching the most relevant text chunks, we can filter contracts based on their attributes. This means we can answer questions that naive RAG would struggle with, such as how many contracts were signed last month or whether we have any active agreements with a specific company. These questions require aggregation and filtering, which isn’t possible with standard vector-based retrieval alone.

By combining structured and unstructured data, we also make retrieval more context-aware. If a user asks about a contract’s payment terms, we ensure that the search is constrained to the right agreement rather than relying on text similarity, which might pull in terms from unrelated contracts. This hybrid approach overcomes the limitations of naive RAG and allows for a much deeper and more reliable analysis of legal documents.

Graph construction

We’ll leverage an LLM to extract structured information from legal documents, using the CUAD (Contract Understanding Atticus Dataset), a widely used benchmark dataset for contract analysis licensed under CC BY 4.0. CUAD dataset contains over 500 contracts, making it an ideal dataset for evaluating our structured extraction pipeline.

The token count distribution for the contracts is visualized below.

Most contracts in this dataset are relatively short, with token counts below 10,000. However, there are some much longer contracts, with a few reaching up to 80,000 tokens. These long contracts are rare, while shorter ones make up the majority. The distribution shows a steep drop-off, meaning long contracts are the exception rather than the rule.

We’re using Gemini-2.0-Flash for extraction, which has a 1 million token input limit, so handling these contracts isn’t a problem. Even the longest contracts in our dataset (around 80,000 tokens) fit well within the model’s capacity. Since most contracts are much shorter, we don’t have to worry about truncation or breaking documents into smaller chunks for processing.

Structured data extraction

Most commercial LLMs have the option to use Pydantic objects to define the schema of the output. An example for location:

class Location(BaseModel):
    """
    Represents a physical location including address, city, state, and country.
    """

    address: Optional[str] = Field(
        ..., description="The street address of the location.Use None if not provided"
    )
    city: Optional[str] = Field(
        ..., description="The city of the location.Use None if not provided"
    )
    state: Optional[str] = Field(
        ..., description="The state or region of the location.Use None if not provided"
    )
    country: str = Field(
        ...,
        description="The country of the location. Use the two-letter ISO standard.",
    )

When using LLMs for structured output, Pydantic helps define a clear schema by specifying the types of attributes and providing descriptions that guide the model’s responses. Each field has a type, such as str or Optional[str], and a description that tells the LLM exactly how to format the output.

For example, in a Location model, we define key attributes like address, city, state, and country, specifying what data is expected and how it should be structured. The country field, for instance, follows two-letter country code standard like "US", "FR", or "JP", instead of inconsistent variations like “United States” or “USA.” This principle applies to other structured data as well, ISO 8601 keeps dates in a standard format (YYYY-MM-DD), and so on.

By defining structured output with Pydantic, we make LLM responses more reliable, machine-readable, and easier to integrate into databases or APIs. Clear field descriptions further help the model generate correctly formatted data, reducing the need for post-processing.

The Pydantic schema models can be more sophisticated like the Contract model below, which captures key details of a legal agreement, ensuring the extracted data follows a standardized structure.

class Contract(BaseModel):
    """
    Represents the key details of the contract.
    """
  
    summary: str = Field(
        ...,
        description=("High level summary of the contract with relevant facts and details. Include all relevant information to provide full picture."
        "Do no use any pronouns"),
    )
    contract_type: str = Field(
        ...,
        description="The type of contract being entered into.",
        enum=CONTRACT_TYPES,
    )
    parties: List[Organization] = Field(
        ...,
        description="List of parties involved in the contract, with details of each party's role.",
    )
    effective_date: str = Field(
        ...,
        description=(
            "Enter the date when the contract becomes effective in yyyy-MM-dd format."
            "If only the year (e.g., 2015) is known, use 2015-01-01 as the default date."
            "Always fill in full date"
        ),
    )
    contract_scope: str = Field(
        ...,
        description="Description of the scope of the contract, including rights, duties, and any limitations.",
    )
    duration: Optional[str] = Field(
        None,
        description=(
            "The duration of the agreement, including provisions for renewal or termination."
            "Use ISO 8601 durations standard"
        ),
    )
  
    end_date: Optional[str] = Field(
        None,
        description=(
            "The date when the contract expires. Use yyyy-MM-dd format."
            "If only the year (e.g., 2015) is known, use 2015-01-01 as the default date."
            "Always fill in full date"
        ),
    )
    total_amount: Optional[float] = Field(
        None, description="Total value of the contract."
    )
    governing_law: Optional[Location] = Field(
        None, description="The jurisdiction's laws governing the contract."
    )
    clauses: Optional[List[Clause]] = Field(
        None, description=f"""Relevant summaries of clause types. Allowed clause types are {CLAUSE_TYPES}"""
    )

This contract schema organizes key details of legal agreements in a structured way, making it easier to analyze with LLMs. It includes different types of clauses, such as confidentiality or termination, each with a short summary. The parties involved are listed with their names, locations, and roles, while contract details cover things like start and end dates, total value, and governing law. Some attributes, such as governing law, can be defined using nested models, enabling more detailed and complex outputs.

The nested object approach works well with some AI models that handle complex data relationships, while others may struggle with deeply nested details.

We can test our approach using the following example. We are using the LangChain framework to orchestrate LLMs.

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
llm.with_structured_output(Contract).invoke(
    "Tomaz works with Neo4j since 2017 and will make a billion dollar until 2030."
    "The contract was signed in Las Vegas"
)

which outputs

Contract(
    summary="Tomaz works with Neo4j since 2017 and will make a billion dollar until 2030.",
    contract_type="Service",
    parties=[
        Organization(
            name="Tomaz",
            location=Location(
                address=None,
                city="Las Vegas",
                state=None,
                country="US"
            ),
            role="employee"
        ),
        Organization(
            name="Neo4j",
            location=Location(
                address=None,
                city=None,
                state=None,
                country="US"
            ),
            role="employer"
        )
    ],
    effective_date="2017-01-01",
    contract_scope="Tomaz will work with Neo4j",
    duration=None,
    end_date="2030-01-01",
    total_amount=1_000_000_000.0,
    governing_law=None,
    clauses=None
)

Now that our contract data is in a structured format, we can define the Cypher query needed to import it into Neo4j, mapping entities, relationships, and key clauses into a graph structure. This step transforms raw extracted data into a queryable knowledge graph, enabling efficient traversal and retrieval of contract insights.

UNWIND $data AS row
MERGE (c:Contract {file_id: row.file_id})
SET c.summary = row.summary,
    c.contract_type = row.contract_type,
    c.effective_date = date(row.effective_date),
    c.contract_scope = row.contract_scope,
    c.duration = row.duration,
    c.end_date = CASE WHEN row.end_date IS NOT NULL THEN date(row.end_date) ELSE NULL END,
    c.total_amount = row.total_amount
WITH c, row
CALL (c, row) {
    WITH c, row
    WHERE row.governing_law IS NOT NULL
    MERGE (c)-[:HAS_GOVERNING_LAW]->(l:Location)
    SET l += row.governing_law
}
FOREACH (party IN row.parties |
    MERGE (p:Party {name: party.name})
    MERGE (p)-[:HAS_LOCATION]->(pl:Location)
    SET pl += party.location
    MERGE (p)-[pr:PARTY_TO]->(c)
    SET pr.role = party.role
)
FOREACH (clause IN row.clauses |
    MERGE (c)-[:HAS_CLAUSE]->(cl:Clause {type: clause.clause_type})
    SET cl.summary = clause.summary
)

This Cypher query imports structured contract data into Neo4j by creating Contract nodes with attributes such as summary, contract_type, effective_date, duration, and total_amount. If a governing law is specified, it links the contract to a Location node. Parties involved in the contract are stored as Party nodes, with each party connected to a Location and assigned a role in relation to the contract. The query also processes clauses, creating Clause nodes and linking them to the contract while storing their type and summary.

After processing and importing the contracts, the resulting graph follows the following graph schema.

Imported legal graph schema

Let’s also take a look at a single contract.

This graph represents a contract structure where a contract (orange node) connects to various clauses (red nodes), parties (blue nodes), and locations (violet nodes). The contract has three clauses: Renewal & Termination, Liability & Indemnification, and Confidentiality & Non-Disclosure. Two parties, Modus Media International and Dragon Systems, Inc., are involved, each linked to their respective locations, Netherlands (NL) and United States (US). The contract is governed by U.S. law. The contract node also contains additional metadata, including dates and other relevant details.

A public read-only instance containing CUAD legal contracts is available with the following credentials.

URI: neo4j+s://demo.neo4jlabs.com
username: legalcontracts
password: legalcontracts
database: legalcontracts

Entity resolution

Entity resolution in legal contracts is challenging due to variations in how companies, individuals, and locations are referenced. A company might appear as “Acme Inc.” in one contract and “Acme Corporation” in another, requiring a process to determine whether they refer to the same entity.

One approach is to generate candidate matches using text embeddings or string distance metrics like Levenshtein distance. Embeddings capture semantic similarity, while string distance measures character-level differences. Once candidates are identified, additional evaluation is needed, comparing metadata such as addresses or tax IDs, analyzing shared relationships in the graph, or incorporating human review for critical cases.

For resolving entities at scale, both open-source solutions like Dedupe and commercial tools like Senzing offer automated methods. Choosing the right approach depends on data quality, accuracy requirements, and whether manual oversight is feasible.

With the legal graph constructed, we can move onto the agentic GraphRAG implementation. 

Agentic GraphRAG

Agentic architectures vary widely in complexity, modularity, and reasoning capabilities. At their core, these architectures involve an LLM acting as a central reasoning engine, often supplemented with tools, memory, and orchestration mechanisms. The key differentiator is how much autonomy the LLM has in making decisions and how interactions with external systems are structured.

One of the simplest and most effective designs, particularly for chatbot-like implementations, is a direct LLM-with-tools approach. In this setup, the LLM serves as the decision-maker, dynamically selecting which tools to invoke (if any), retrying operations when necessary, and executing multiple tools in sequence to fulfill complex requests. 

The diagram represents a simple LangGraph agent workflow. It begins at __start__, moving to the assistant node, where the LLM processes user input. From there, the assistant can either call tools to fetch relevant information or transition directly to __end__ to complete the interaction. If a tool is used, the assistant processes the response before deciding whether to call another tool or end the session. This structure allows the agent to autonomously determine when external information is needed before responding.

This approach is particularly well-suited to stronger commercial models like Gemini or GPT-4o, which excel at reasoning and self-correction.

Tools

LLMs are powerful reasoning engines, but their effectiveness often depends on how well they are equipped with external tools. These tools , whether database queries, APIs, or search functions, extend an LLM’s ability to retrieve facts, perform calculations, or interact with structured data. 

Designing tools that are both general enough to handle diverse queries and precise enough to return meaningful results is more art than science. What we’re really building is a semantic layer between the LLM and the underlying data. Rather than requiring the LLM to understand the exact structure of a Neo4j knowledge graph or a database schema, we define tools that abstract away these complexities.

With this approach, the LLM doesn’t need to know whether contract information is stored as graph nodes and relationships or as raw text in a document store. It only needs to invoke the right tool to fetch relevant data based on a user’s question.

In our case, the contract retrieval tool serves as this semantic interface. When a user asks about contract terms, obligations, or parties, the LLM calls a structured query tool that translates the request into a database query, retrieves relevant information, and presents it in a format the LLM can interpret and summarize. This enables a flexible, model-agnostic system where different LLMs can interact with contract data without needing direct knowledge of its storage or structure.

There’s no one-size-fits-all standard for designing an optimal toolset. What works well for one model may fail for another. Some models handle ambiguous tool instructions gracefully, while others struggle with complex parameters or require explicit prompting. The trade-off between generality and task-specific efficiency means tool design requires iteration, testing, and fine-tuning for the LLM in use.
For contract analysis, an effective tool should retrieve contracts and summarize key terms without requiring users to phrase queries rigidly. Achieving this flexibility depends on thoughtful prompt engineering, robust schema design, and adaptation to different LLM capabilities. As models evolve, so do strategies for making tools more intuitive and effective.

In this section, we’ll explore different approaches to tool implementation, comparing their flexibility, effectiveness, and compatibility with various LLMs.

My preferred approach is to dynamically and deterministically construct a Cypher query and execute it against the database. This method ensures consistent and predictable query generation while maintaining implementation flexibility. By structuring queries this way, we reinforce the semantic layer, allowing user inputs to be seamlessly translated into database retrievals. This keeps the LLM focused on retrieving relevant information rather than understanding the underlying data model.

Our tool is intended to identify relevant contracts, so we need to provide the LLM with options to search contracts based on various attributes. The input description is again provided as a Pydantic object.

class ContractInput(BaseModel):
    min_effective_date: Optional[str] = Field(
        None, description="Earliest contract effective date (YYYY-MM-DD)"
    )
    max_effective_date: Optional[str] = Field(
        None, description="Latest contract effective date (YYYY-MM-DD)"
    )
    min_end_date: Optional[str] = Field(
        None, description="Earliest contract end date (YYYY-MM-DD)"
    )
    max_end_date: Optional[str] = Field(
        None, description="Latest contract end date (YYYY-MM-DD)"
    )
    contract_type: Optional[str] = Field(
        None, description=f"Contract type; valid types: {CONTRACT_TYPES}"
    )
    parties: Optional[List[str]] = Field(
        None, description="List of parties involved in the contract"
    )
    summary_search: Optional[str] = Field(
        None, description="Inspect summary of the contract"
    )
    country: Optional[str] = Field(
        None, description="Country where the contract applies. Use the two-letter ISO standard."
    )
    active: Optional[bool] = Field(None, description="Whether the contract is active")
    monetary_value: Optional[MonetaryValue] = Field(
        None, description="The total amount or value of a contract"
    )

With LLM tools, attributes can take various forms depending on their purpose. Some fields are simple strings, such as contract_type and country, which store single values. Others, like parties, are lists of strings, allowing multiple entries (e.g., multiple entities involved in a contract).

Beyond basic data types, attributes can also represent complex objects. For example, monetary_value uses a MonetaryValue object, which includes structured data such as currency type and the operator. While attributes with nested objects offer a clear and structured representation of data, models tend to struggle to handle them effectively, so we should keep them simple.

As part of this project, we’re experimenting with an additional cypher_aggregation attribute, providing the LLM with greater flexibility for scenarios that require specific filtering or aggregation.

cypher_aggregation: Optional[str] = Field(
    None,
    description="""Custom Cypher statement for advanced aggregations and analytics.

    This will be appended to the base query:
    ```
    MATCH (c:Contract)
    <filtering based on other parameters>
    WITH c, summary, contract_type, contract_scope, effective_date, end_date, parties, active, monetary_value, contract_id, countries
    <your cypher goes here>
    ```
    
    Examples:
    
    1. Count contracts by type:
    ```
    RETURN contract_type, count(*) AS count ORDER BY count DESC
    ```
    
    2. Calculate average contract duration by type:
    ```
    WITH contract_type, effective_date, end_date
    WHERE effective_date IS NOT NULL AND end_date IS NOT NULL
    WITH contract_type, duration.between(effective_date, end_date).days AS duration
    RETURN contract_type, avg(duration) AS avg_duration ORDER BY avg_duration DESC
    ```
    
    3. Calculate contracts per effective date year:
    ```
    RETURN effective_date.year AS year, count(*) AS count ORDER BY year
    ```
    
    4. Counts the party with the highest number of active contracts:
    ```
    UNWIND parties AS party
    WITH party.name AS party_name, active, count(*) AS contract_count
    WHERE active = true
    RETURN party_name, contract_count
    ORDER BY contract_count DESC
    LIMIT 1
    ```
    """

The cypher_aggregation attribute allows LLMs to define custom Cypher statements for advanced aggregations and analytics. It extends the base query by appending question-specified aggregation logic, enabling flexible filtering and computation.

This feature supports use cases such as counting contracts by type, calculating average contract duration, analyzing contract distributions over time, and identifying key parties based on contract activity. By leveraging this attribute, the LLM can dynamically generate insights tailored to specific analytical needs without requiring predefined query structures.

While this flexibility is valuable, it should be carefully evaluated, as increased adaptability comes at the cost of reduced consistency and robustness due to the added complexity of the operation.

We must clearly define the function’s name and description when presenting it to the LLM. A well-structured description helps guide the model in using the function correctly, ensuring it understands its purpose, expected inputs, and outputs. This reduces ambiguity and improves the LLM’s ability to generate meaningful and reliable queries.

class ContractSearchTool(BaseTool):
    name: str = "ContractSearch"
    description: str = (
        "useful for when you need to answer questions related to any contracts"
    )
    args_schema: Type[BaseModel] = ContractInput

Finally, we need to implement a function that processes the given inputs, constructs the corresponding Cypher statement, and executes it efficiently.

The core logic of the function centers on constructing the Cypher statement. We begin by matching the contract as the foundation of the query.

cypher_statement = "MATCH (c:Contract) "

Next, we need to implement the function that processes the input parameters. In this example, we primarily use attributes to filter contracts based on the given criteria.


Simple property filtering
For example, the contract_type attribute is used to perform simple node property filtering.

if contract_type:
    filters.append("c.contract_type = $contract_type")
    params["contract_type"] = contract_type

This code adds a Cypher filter for contract_type while using query parameters for values to prevent query injection security issue.

Since the possible contract type values are presented in the attribute description

contract_type: Optional[str] = Field(
    None, description=f"Contract type; valid types: {CONTRACT_TYPES}"
)

we don’t have to worry about mapping values from input to valid contract types as the LLM will handle that.

Inferred property filtering

We’re building tools for an LLM to interact with a knowledge graph, where the tools serve as an abstraction layer over structured queries. A key feature is the ability to use inferred properties at runtime, similar to an ontology but dynamically computed.

if active is not None:
    operator = ">=" if active else "<"
    filters.append(f"c.end_date {operator} date()")

Here, active acts as a runtime classification, determining whether a contract is ongoing (>= date()) or expired (< date()). This logic extends structured KG queries by computing properties only when needed, enabling more flexible LLM reasoning. By handling logic like this within tools, we ensure the LLM interacts with simplified, intuitive operations, keeping it focused on reasoning rather than query formulation.

Neighbor filtering

Sometimes filtering depends on neighboring nodes, such as restricting results to contracts involving specific parties. The parties attribute is an optional list, and when provided, it ensures only contracts linked to those entities are considered:

if parties:
    parties_filter = []
    for i, party in enumerate(parties):
        party_param_name = f"party_{i}"
        parties_filter.append(
            f"""EXISTS {{
            MATCH (c)<-[:PARTY_TO]-(party)
            WHERE toLower(party.name) CONTAINS ${party_param_name}
        }}"""
        )
        params[party_param_name] = party.lower()

This code filters contracts based on their associated parties, treating the logic as AND, meaning all specified conditions must be met for a contract to be included. It iterates through the provided parties list and constructs a query where each party condition must hold.

For each party, a unique parameter name is generated to avoid conflicts. The EXISTS clause ensures that the contract has a PARTY_TO relationship to a party whose name contains the specified value. The name is converted to lowercase to allow case-insensitive matching. Each party condition is added separately, enforcing an implicit AND between them.

If more complex logic were needed, such as supporting OR conditions or allowing different matching criteria, the input would need to change. Instead of a simple list of party names, a structured input format specifying operators would be required.

Additionally, we could implement a party-matching method that tolerates minor typos, improving the user experience by handling variations in spelling and formatting.

Custom operator filtering

To add more flexibility, we can introduce an operator object as a nested attribute, allowing more control over filtering logic. Instead of hardcoding comparisons, we define an enumeration for operators and use it dynamically.

For example, with monetary values, a contract might need to be filtered based on whether its total amount is greater than, less than, or exactly equal to a specified value. Instead of assuming a fixed comparison logic, we define an enum that represents the possible operators:

class NumberOperator(str, Enum):
    EQUALS = "="
    GREATER_THAN = ">"
    LESS_THAN = "<"

class MonetaryValue(BaseModel):
    """The total amount or value of a contract"""
    value: float
    operator: NumberOperator

if monetary_value:
    filters.append(f"c.total_amount {monetary_value.operator.value} $total_value")
    params["total_value"] = monetary_value.value

This approach makes the system more expressive. Instead of rigid filtering rules, the tool interface allows the LLM to specify not just a value but how it should be compared, making it easier to handle a broader range of queries while keeping the LLM’s interaction simple and declarative.

Some LLMs struggle with nested objects as inputs, making it harder to handle structured operator-based filtering. Adding a between operator introduces additional complexity since it requires two separate values, which can lead to ambiguity in parsing and input validation.

Min and Max attributes

To keep things simpler, I tend to gravitate toward using min and max attributes for dates, as this naturally supports range filtering and makes the between logic straightforward.

if min_effective_date:
    filters.append("c.effective_date >= date($min_effective_date)")
    params["min_effective_date"] = min_effective_date
if max_effective_date:
    filters.append("c.effective_date <= date($max_effective_date)")
    params["max_effective_date"] = max_effective_date

This function filters contracts based on an effective date range by adding an optional lower and upper bound condition when min_effective_date and max_effective_date are provided, ensuring that only contracts within the specified date range are included.

Semantic search

An attribute can also be used for semantic search, where instead of relying on a vector index upfront, we use a post-filtering approach to metadata filtering. First, structured filters, like date ranges, monetary values, or parties, are applied to narrow down the candidate set. Then, vector search is performed over this filtered subset to rank results based on semantic similarity. 

if summary_search:
    cypher_statement += (
        "WITH c, vector.similarity.cosine(c.embedding, $embedding) "
        "AS score ORDER BY score DESC WITH c, score WHERE score > 0.9 "
    )  # Define a threshold limit
    params["embedding"] = embeddings.embed_query(summary_search)
else:  # Else we sort by latest
    cypher_statement += "WITH c ORDER BY c.effective_date DESC "

This code applies semantic search when summary_search is provided by computing cosine similarity between the contract’s embedding and the query embedding, ordering results by relevance, and filtering out low-scoring matches with a threshold of 0.9. Otherwise, it defaults to sorting contracts by the most recent effective_date.

Dynamic queries

The cypher aggregation attribute is an experiment I wanted to test that gives the LLM a degree of partial text2cypher capability, allowing it to dynamically generate aggregations after the initial structured filtering. Instead of predefining every possible aggregation, this approach lets the LLM specify calculations like counts, averages, or grouped summaries on demand, making queries more flexible and expressive. However, since this shifts more query logic to the LLM, ensuring all generated queries work correctly becomes challenging, as malformed or incompatible Cypher statements can break execution. This trade-off between flexibility and reliability is a key consideration in designing the system.

if cypher_aggregation:
    cypher_statement += """WITH c, c.summary AS summary, c.contract_type AS contract_type, 
      c.contract_scope AS contract_scope, c.effective_date AS effective_date, c.end_date AS end_date,
      [(c)<-[r:PARTY_TO]-(party) | {party: party.name, role: r.role}] AS parties, c.end_date >= date() AS active, c.total_amount as monetary_value, c.file_id AS contract_id,
      apoc.coll.toSet([(c)<-[:PARTY_TO]-(party)-[:LOCATED_IN]->(country) | country.name]) AS countries """
    cypher_statement += cypher_aggregation

If no cypher aggregation is provided, we return the total count of identified contracts along with only five example contracts to avoid overwhelming the prompt. Handling excessive rows is crucial, as an LLM struggling with a massive result set isn’t useful. Additionally, LLM producing answers with 100 contract titles isn’t a good user experience either.

cypher_statement += """WITH collect(c) AS nodes
RETURN {
    total_count_of_contracts: size(nodes),
    example_values: [
      el in nodes[..5] |
      {summary:el.summary, contract_type:el.contract_type, 
       contract_scope: el.contract_scope, file_id: el.file_id, 
        effective_date: el.effective_date, end_date: el.end_date,
        monetary_value: el.total_amount, contract_id: el.file_id, 
        parties: [(el)<-[r:PARTY_TO]-(party) | {name: party.name, role: r.role}], 
        countries: apoc.coll.toSet([(el)<-[:PARTY_TO]-()-[:LOCATED_IN]->(country) | country.name])}
    ]
} AS output"""

This cypher statement collects all matching contracts into a list, returning the total count and up to five example contracts with key attributes, including summary, type, scope, dates, monetary value, associated parties with roles, and unique country locations.
Now that our contract search tool is built, we hand it off to the LLM and just like that, we have agentic GraphRAG implemented.

Agent Benchmark

If you’re serious about implementing agentic GraphRAG, you need an evaluation dataset, not just as a benchmark but as a foundation for the entire project. A well-constructed dataset helps define the scope of what the system should handle, ensuring that initial development aligns with real-world use cases. Beyond that, it becomes an invaluable tool for evaluating performance, allowing you to measure how well the LLM interacts with the graph, retrieves information, and applies reasoning. It’s also essential for prompt engineering optimizations, letting you iteratively refine queries, tool use, and response formatting with clear feedback rather than guesswork. Without a structured dataset, you’re flying blind, making improvements harder to quantify and inconsistencies more difficult to catch.

The code for the benchmark is available on GitHub.

I have compiled a list of 22 questions which we will use to evaluate the system. Additionally, we are going to introduce a new metric called answer_satisfaction where we will be provide a custom prompt.

answer_satisfaction = AspectCritic(
    name="answer_satisfaction",
    definition="""You will evaluate an ANSWER to a legal QUESTION based on a provided SOLUTION.

Rate the answer on a scale from 0 to 1, where:
- 0 = incorrect, substantially incomplete, or misleading
- 1 = correct and sufficiently complete

Consider these evaluation criteria:
1. Factual correctness is paramount - the answer must not contradict the solution
2. The answer must address the core elements of the solution
3. Additional relevant information beyond the solution is acceptable and may enhance the answer
4. Technical legal terminology should be used appropriately if present in the solution
5. For quantitative legal analyses, accurate figures must be provided

+ fewshots
"""

Many questions can return a large amount of information. For example, asking for contracts signed before 2020 might yield hundreds of results. Since the LLM receives both the total count and a few example entries, our evaluation should focus on the total count, rather than which specific examples the LLM chooses to show.

Benchmark results.

The provided results indicate that all evaluated models (Gemini 1.5 Pro, Gemini 2.0 Flash, and GPT-4o) perform similarly well for most tool calls, with GPT-4o slightly outperforming the Gemini models (0.82 vs. 0.77). The noticeable difference emerges primarily when partial text2cypher is used, particularly for various aggregation operations.

Note that this is only 22 fairly simple questions, so we didn’t really explore reasoning capabilities of LLMs.

Additionally, I’ve seen projects where accuracy can be improved significantly by leveraging Python for aggregations, as LLMs typically handle Python code generation and execution better than generating complex Cypher queries directly.

Web Application

We’ve also built a simple React web application, powered by LangGraph hosted on FastAPI, which streams responses directly to the frontend. Special thanks to Anej Gorkic for creating the web app.

You can launch the entire stack with the following command:

docker compose up

And navigate to localhost:5173 

Summary

As LLMs gain stronger reasoning capabilities, they, when paired with the right tools, can become powerful agents for navigating complex domains like legal contracts. In this post, we’ve only scratched the surface, focusing on core contract attributes while barely touching the rich variety of clauses found in real-world agreements. There’s significant room for growth, from expanding clause coverage to refining tool design and interaction strategies.

The code is available on GitHub.

Images

All images in this post were created by the author.

The post Agentic GraphRAG for Commercial Contracts appeared first on Towards Data Science.

]]>
Agentic AI: Single vs Multi-Agent Systems https://towardsdatascience.com/agentic-ai-single-vs-multi-agent-systems/ Wed, 02 Apr 2025 00:15:17 +0000 https://towardsdatascience.com/?p=605376 Demonstrated by building a tech news agent in LangGraph

The post Agentic AI: Single vs Multi-Agent Systems appeared first on Towards Data Science.

]]>
We’ve seen this shift the last few years from building rigid programming systems to natural language-driven workflows, all made possible with more advanced large language models.

One of the interesting areas into these Agentic Ai systems is the difference between building a single versus multi-agent workflow, or perhaps the difference between working with more flexible vs controlled systems.

This article will help you understand what agentic AI is, how to build simple workflows with LangGraph, and the differences in results you can achieve with the different architectures. I’ll demonstrate this by building a tech news agent with various data sources.

As for the use case, I’m a bit obsessed with getting automatic news updates, based on my preferences, without me drowning in information overload every day.

Having AI summarize for us instead of scouting info on our own | Image by author

Working with summarizing and gathering research is one of those areas that agentic AI can really shine.

So follow along while I keep trying to make AI do the grunt work for me, and we’ll see how single-agent compares to multi-agent setups.

I always keep my work jargon-free, so if you’re new to agentic AI, this piece should help you understand what it is and how to work with it. If you’re not new to it, you can scroll past some of the sections.

Agentic AI (& LLMs)

Agentic AI is about programming with natural language. Instead of using rigid, explicit code, you’re instructing large language models (LLMs) to route data and perform actions through plain language in automating tasks.

Using natural language in workflows isn’t new, we’ve used NLP for years to extract and process data. What’s new is the amount of freedom we can now give language models, allowing them to handle ambiguity and make decisions dynamically.

Traditional automation from programmatic to NLP to LLMs | Image by author

But just because LLMs can understand nuanced language doesn’t mean they inherently validate facts or maintain data integrity. I see them primarily as a communication layer that sits on top of structured systems and existing data sources.

LLMs is a communication layer and not the system itself | Image by author

I usually explain it like this to non-technical people: they work a bit like we do. If we don’t have access to clean, structured data, we start making things up. Same with LLMs. They generate responses based on patterns, not truth-checking.

So just like us, they do their best with what they’ve got. If we want better output, we need to build systems that give them reliable data to work with. So, with Agentic systems we integrate ways for them to interact with different data sources, tools and systems.

Now, just because we can use these larger models in more places, doesn’t mean we should. LLMs shine when interpreting nuanced natural language, think customer service, research, or human-in-the-loop collaboration.

But for structured tasks — like extracting numbers and sending them somewhere — you need to use traditional approaches. LLMs aren’t inherently better at math than a calculator. So, instead of having an LLM do calculations, you give an LLM access to a calculator.

So whenever you can build parts of a workflow programmatically, that will still be the better option.

Nevertheless, LLMs are great at adapting to messy real-world input and interpreting vague instructions so combining the two can be a great way to build systems.

Agentic Frameworks

I know a lot of people jump straight to CrewAI or AutoGen here, but I’d recommend checking out LangGraph, Agno, Mastra, and Smolagents. Based on my research, these frameworks have received some of the strongest feedback so far.

I collect resources in a Github repo here with the most popular frameworks | Image by author

LangGraph is more technical and can be complex, but it’s the preferred choice for many developers. Agno is easier to get started with but less technical. Mastra is a solid option for JavaScript developers, and Smolagents shows a lot of promise as a lightweight alternative.

In this case, I’ve gone with LangGraph — built on top of LangChain — not because it’s my favorite, but because it’s becoming a go-to framework that more devs are adopting.

So, it’s worth being familiar with.

It has a lot of abstractions though, where you may want to rebuild some of it just to be able to control and understand it better.

I will not go into detail on LangGraph here, so I decided to build a quick guide for those that need to get a review.

As for this use case, you’ll be able to run the workflow without coding anything, but if you’re here to learn you may also want to understand how it works.

Choosing an LLM

Now, you might jump into this and wonder why I’m choosing certain LLMs as the base for the agents.

You can’t just pick any model, especially when working within a framework. They need to be compatible. Key things to look for are tool calling support and the ability to generate structured outputs.

I’d recommend checking HuggingFace’s Agent Leaderboard to see which models actually perform well in real-world agentic systems.

For this workflow, you should be fine using models from Anthropic, OpenAI, or Google. If you’re considering another one, just make sure it’s compatible with LangChain.

Single vs. Multi-Agent Systems

If you build a system around one LLM and give it a bunch of tools you want it to use, you’re working with a single-agent workflow. It’s fast, and if you’re new to agentic AI, it might seem like the model should just figure things out on its own.

One agent has access to many tools | Image by author

But the thing is these workflows are just another form of system design. Like any software project, you need to plan the process, define the steps, structure the logic, and decide how each part should behave.

Think about how the logic should work for your use case | Image by author

This is where multi-agent workflows come in.

Not all of them are hierarchical or linear though, some are collaborative. Collaborative workflows would then also fall into the more flexible approach that I find more difficult to work with, at least as it is now with the capabilities that exist.

However, collaborative workflows do also break apart different functions into their own modules.

Single-agent and collaborative workflows are great to start with when you’re just playing around, but they don’t always give you the precision needed for actual tasks.

For the workflow I will build here, I already know how the APIs should be used — so it’s my job to guide the system to use it the right way.

We’ll go through comparing a single-agent setup with a hierarchical multi-agent system, where a lead agent delegates tasks across a small team so you can see how they behave in practice.

Building a Single Agent Workflow

With a single thread — i.e., one agent — we give an LLM access to several tools. It’s up to the agent to decide which tool to use and when, based on the user’s question.

One LLM/Agent has access to many tool with many options | Image by author

The challenge with a single agent is control.

No matter how detailed the system prompt is, the model may not follow our requests (this can happen in more controlled environments too). If we give it too many tools or options, there’s a good chance it won’t use all of them or even use the right ones.

To illustrate this, we’ll build a tech news agent that has access to several API endpoints with custom data with several options as parameters in the tools. It’s up to the agent to decide how many to use and how to setup the final summary.

Remember, I build these workflows using LangGraph. I won’t go into LangGraph in depth here, so if you want to learn the basics to be able to tweak the code, go here.

You can find the single-agent workflow here. To run it, you’ll need LangGraph Studio and the latest version of Docker installed.

Once you’re set up, open the project folder on your computer, add your GOOGLE_API_KEY in a .env file, and save. You can get a key from Google here.

Gemini Flash 2.0 has a generous free tier, so running this shouldn’t cost anything (but you may run into errors if you use it too much).

If you want to switch to another LLM or tools, you can tweak the code directly. But, again, remember the LLM needs to be compatible.

After setup, launch LangGraph Studio and select the correct folder.

This will boot up our workflow so we can test it.

Opening LangGraph Studio | Image by author

If you run into issues booting this up, double-check that you’re using the latest version of Docker.

Once it’s loaded, you can test the workflow by entering a human message and hitting submit.

LangGraph Studio opening the single agent workflow | Image by author

You can see me run the workflow below.

LangGraph Studio running the single agent workflow | Image by author

You can see the final response below.

LangGraph Studio finishing the single agent workflow | Image by author

For this prompt it decided that it would check weekly trending keywords filtered by the category ‘companies’ only, and then it fetched the sources of those keywords and summarized for us.

It had some issues in giving us a unified summary, where it simply used the information it got last and failed to use all of the research.

In reality we want it to fetch both trending and top keywords within several categories (not just companies), check sources, track specific keywords, and reason and summarize it all nicely before returning a response.

We can of course probe it and keep asking it questions but as you can imagine if we need something more complex it would start to make shortcuts in the workflow.

The key thing is, an agent system isn’t just gonna think the way we expect, we have to actually orchestrate it to do what we want.

So a single agent is great for something simple but as you can imagine it may not think or behave as we are expecting.

This is why going for a more complex system where each agent is responsible for one thing can be really useful.

Testing a Multi-Agent Workflow

Building multiagent workflows is a lot more difficult than building a single agent with access to some tools. To do this, you need to carefully think about the architecture beforehand and how data should flow between the agents.

The multi-agent workflow I’ll set up here uses two different teams — a research team and an editing team — with several agents under each.

Every agent has access to a specific set of tools.

The multiagent workflow logic with a hierarchical team | Image by author

We’re introducing some new tools, like a research pad that acts as a shared space — one team writes their findings, the other reads from it. The last LLM will read everything that has been researched and edited to make a summary.

An alternative to using a research pad is to store data in a scratchpad in state, isolating short-term memory for each team or agent. But that also means thinking carefully about what each agent’s memory should include.

I also decided to build out the tools a bit more to provide richer data upfront, so the agents don’t have to fetch sources for each keyword individually. Here I’m using normal programmatic logic because I can.

A key thing to remember: if you can use normal programming logic, do it.

Since we’re using multiple agents, you can lower costs by using cheaper models for most agents and reserving the more expensive ones for the important stuff.

Here, I’m using Gemini Flash 2.0 for all agents except the summarizer, which runs on OpenAI’s GPT-4o. If you want higher-quality summaries, you can use an even more advanced LLM with a larger context window.

The workflow is set up for you here. Before loading it, make sure to add both your OpenAI and Google API keys in a .env file.

In this workflow, the routes (edges) are setup dynamically instead of manually like we did with the single agent. It’ll look more complex if you peek into the code.

Once you boot up the workflow in LangGraph Studio — same process as before — you’ll see the graph with all these nodes ready.

Opening the multiagent workflow in LangGraph Studio | Image by author

LangGraph Studio lets us visualize how the system delegates work between agents when we run it—just like we saw in the simpler workflow above.

Since I understand the tools each agent is using, I can prompt the system in the right way. But regular users won’t know how to do this properly. So if you’re building something similar, I’d suggest introducing an agent that transforms the user’s query into something the other agents can actually work with.

We can test it out by setting a message.

“I’m an investor and I’m interested in getting an update for what has happened within the week in tech, and what people are talking about (this means categories like companies, people, websites and subjects are interesting). Please also track these specific keywords: AI, Google, Microsoft, and Large Language Models”

Then choosing “supervisor” as the Next parameter (we’d normally do this programmatically).

Running the multiagent workflow in LangGraph Studio — it will take several minutes | Image by author

This workflow will take several minutes to run, unlike the single-agent workflow we ran earlier which finished in under a minute.

So be patient while the tools are running.

In general, these systems take time to gather and process information and that’s just something we need to get used to.

The final summary will look something like this:

The result from the multiagent workflow in LangGraph Studio | Image by author

You can read the whole thing here instead if you want to check it out.

The news will obviously vary depending on when you run the workflow. I ran it the 28th of March so the example report will be for this date.

It should save the summary to a text document, but if you’re running this inside a container, you likely won’t be able to access that file easily. It’s better to send the output somewhere else — like Google Docs or via email.

As for the results, I’ll let you decide for yourself the difference between using a more complex system versus a simple one, and how it gives us more control over the process.

Finishing Notes

I’m working with a good data source here. Without that, you’d need to add a lot more error handling, which would slow everything down even more.

Clean and structured data is key. Without it, the LLM won’t perform at its best.

Even with solid data, it’s not perfect. You still need to work on the agents to make sure they do what they’re supposed to.

You’ve probably already noticed the system works — but it’s not quite there yet.

There are still several things that need improvement: parsing the user’s query into a more structured format, adding guardrails so agents always use their tools, summarizing more effectively to keep the research doc concise, improving error handling, and introducing long-term memory to better understand what the user actually needs.

State (short-term memory) is especially important if you want to optimize for performance and cost.

Right now, we’re just pushing every message into state and giving all agents access to it, which isn’t ideal. We really want to separate state between the teams. In this case, it’s something I haven’t done, but you can try it by introducing a scratchpad in the state schema to isolate what each team knows.

Regardless, I hope it was a fun experience to understand the results we can get by building different Agentic Workflows.

If you want to see more of what I’m working on, you can follow me here but also on Medium, GitHub, or LinkedIn (though I’m hoping to move over to X soon). I also have a Substack, where I hope to publishing shorter pieces in.

❤

The post Agentic AI: Single vs Multi-Agent Systems appeared first on Towards Data Science.

]]>
Understanding the Tech Stack Behind Generative AI https://towardsdatascience.com/tech-stack-generative-ai/ Tue, 01 Apr 2025 00:35:03 +0000 https://towardsdatascience.com/?p=605364 From foundation models to vector databases and AI agents — what makes modern AI work

The post Understanding the Tech Stack Behind Generative AI appeared first on Towards Data Science.

]]>
Understanding the Tech Stack Behind Generative AI

When ChatGPT reached the one million user mark within five days and took off faster than any other technology in history, the world began to pay attention to artificial intelligence and AI applications.

And so it continued apace. Since then, many different terms have been buzzing around — from ChatGPT and Nvidia H100 chips to Ollama, LangChain, and Explainable AI. What is actually meant for what?

That’s exactly what you’ll find in this article: A structured overview of the technology ecosystem around generative AI and LLMs.

Let’s dive in!

Table of Contents
1 What makes generative AI work – at its core
2 Scaling AI: Infrastructure and Compute Power
3 The Social Layer of AI: Explainability, Fairness and Governance
4 Emerging Abilities: When AI Starts to Interact and Act
Final Thoughts

Where Can You Continue Learning?

1 What makes generative AI work – at its core

New terms and tools in the field of artificial intelligence seem to emerge almost daily. At the core of it all are the foundational models, frameworks and the infrastructure required to run generative AI in the first place.

Foundation Models

Do you know the Swiss Army Knife? Foundation models are like such a multifunctional knife – you can perform many different tasks with just one tool.

Foundation models are large AI models that have been pre-trained on huge amounts of data (text, code, images, etc.). What is special about these models is that they can not only solve a single task but can also be used flexibly for many different applications. They can write texts, correct code, generate images or even compose music. And they are the basis for many generative AI applications.

The following three aspects are key to understanding foundation models:

  • Pre-trained
    These models were trained on huge data sets. This means that the model has ‘read’ a huge amount of text or other data. This phase is very costly and time-consuming.
  • Multitask-capable
    These foundation models can solve many tasks. If we look at GPT-4o, you can use it to solve everyday questions about knowledge questions, text improvements and code generation.
  • Transferable
    Through fine-tuning or Retrieval Augmented Generation (RAG), we can adapt such Foundation Models to specific domains or specialise them for specific application areas. I have written about RAG and fine-tuning in detail in How to Make Your LLM More Accurate with RAG & Fine-Tuning. But the core of it is that you have two options to make your LLM more accurate: With RAG, the model remains the same, but you improve the input by providing the model with additional sources. For example, the model can access past support tickets or legal texts during a query – but the model parameters and weightings remain unchanged. With fine-tuning, you retrain the pre-trained model with additional sources – the model saves this knowledge permanently.

To get a feel for the amount of data we are talking about, let’s look at FineWeb. FineWeb is a massive dataset developed by Hugging Face to support the pre-training phase of LLMs. The dataset was created from 96 common-crawl snapshots and comprises 15 trillion tokens – which takes up about 44 terabytes of storage space.

Most foundation models are based on the Transformer architecture. In this article, I won’t go into this in more detail as it’s about the high-level components around AI. The most important thing to understand is that these models can look at the entire context of a sentence at the same time, for example – and not just read word by word from left to right. The foundational paper introducing this architecture was Attention is All You Need (2017).

All major players in the AI field have released foundation models — each with different strengths, use cases, and licensing conditions (open-source or closed-source).

GPT-4 from OpenAI, Claude from Anthropic and Gemini from Google, for example, are powerful but closed models. This means that neither the model weights nor the training data are accessible to the public.

There are also high-performing open-source models from Meta, such as LLaMA 2 and LLaMA 3, as well as from Mistral and DeepSeek.

A great resource for comparing these models is the LLM Arena on Hugging Face. It provides an overview of various language models, ranks them and allows for direct comparisons of their performance.

Screenshot taken by the author: We can see a comparison of different llm models in the LLM Arena.

Multimodal models

If we look at the GPT-3 model, it can only process pure text. Multimodal models now go one step further: They can process and generate not only text, but also images, audio and video. In other words, they can process and generate several types of data at the same time.

What does this mean in concrete terms?

Multimodal models process different types of input (e.g. an image and a question about it) and combine this information to provide more intelligent answers. For example, with the Gemini 1.5 version you can upload a photo with different ingredients and ask the question which ingredients you see on this plate.

How does this work technically?

Multimodal models understand not only speech but also visual or auditory information. Multimodal models are also usually based on transformer architecture like pure text models. However, an important difference is that not only words are processed as ‘tokens’ but also images as so-called patches. These are small image sections that are converted into vectors and can then be processed by the model.

Let’s have a look at some examples:

  • GPT-4-Vision
    This model from OpenAI can process text and images. It recognises content on images and combines it with speech.
  • Gemini 1.5
    Google’s model can process text, images, audio and video. It is particularly strong at retaining context across modalities.
  • Claude 3
    Anthropic’s model can process text and images and is very good at visual reasoning. It is good at recognising diagrams, graphics and handwriting.

Other examples are Flamingo from DeepMind, Kosmos-2 from Microsoft or Grok (xAI) from Elon Musk’s xAI, which is integrated into Twitter.

GPU & Compute Providers

When generative AI models are trained, this requires enormous computing capacity. Especially for pre-training but also for inference – the subsequent application of the model to new inputs.

Imagine a musician practising for months to prepare for a concert – that’s what pre-training is like. During pre-training, a model such as GPT-4, Claude 3, LLaMA 3 or DeepSeek-VL learns from trillions of tokens that come from texts, code, images and other sources. These data volumes are processed with GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units). This is necessary because this hardware enables parallel computing (compared to CPUs). Many companies rent computing power in the cloud (e.g. via AWS, Google Cloud, Azure) instead of operating their own servers.

When a pre-trained model is adapted to specific tasks with fine-tuning, this in turn, requires a lot of computing power. This is one of the major differences when the model is customised with RAG. One way to make fine-tuning more resource-efficient is low-rank adaptation (LoRA). Here, small parts of the model are specifically retrained instead of the entire model being trained with new data.

If we stay with the music example, the inference is the moment when the actual live concert takes place, which has to be played over and over again. This example also makes it clear that this also requires resources. Inference is the process of applying an AI model to a new input (e.g. you ask a question to ChatGPT) to generate an answer or a prediction.

Some examples:

Specialised hardware components that are optimised for parallel computing are used for this. For example, NVIDIA’s A100 and H100 GPUs are standard in many data centres. AMD Instinct MI300X, for example, are also catching up as a high-performance alternative. Google TPUs are also used for certain workloads – especially in the Google ecosystem.

ML Frameworks & Libraries

Just like in programming languages or web development, there are frameworks for AI tasks. For example, they provide ready-made functions for building neural networks without the need to program everything from scratch. Or they make training more efficient by parallelising calculations with the framework and making efficient use of GPUs.

The most important ML frameworks for generative AI:

  • PyTorch was developed by Meta and is open source. It is very flexible and popular in research & open source.
  • TensorFlow was developed by Google and is very powerful for large AI models. It supports distributed training – explanation and is often used in cloud environments.
  • Keras is a part of TensorFlow and is mainly used for beginners and prototype development.
  • JAX is also from Google and was specially developed for high-performance AI calculations. It is often used for advanced research and Google DeepMind projects. For example, it is used for the latest Google AI models such as Gemini and Flamingo.

PyTorch and TensorFlow can easily be combined with other tools such as Hugging Face Transformers or ONNX Runtime.

AI Application Frameworks

These frameworks enable us to integrate the Foundation Models into specific applications. They simplify access to the Foundation Models, the management of prompts and the efficient administration of AI-supported workflows.

Three tools, as examples:

  1. LangChain enables the orchestration of LLMs for applications such as chatbots, document processing and automated analyses. It supports access to APIs, databases and external storage. And it can be connected to vector databases – which I explain in the next section – to perform contextual queries.

    Let’s look at an example: A company wants to build an internal AI assistant that searches through documents. With LangChain, it can now connect GPT-4 to the internal database and the user can search company documents using natural language.
  2. LlamaIndex was specifically designed to make large amounts of unstructured data efficiently accessible to LLMs and is therefore important for Retrieval Augmented Generation (RAG). Since LLMs only have a limited knowledge base based on the training data, it allows RAG to retrieve additional information before generating an answer. And this is where LlamaIndex comes into play: it can be used to convert unstructured data, e.g. from PDFs, websites or databases, into searchable indices.

    Let’s take a look at a concrete example:

    A lawyer needs a legal AI assistant to search laws. LlamaIndex organises thousands of legal texts and can therefore provide precise answers quickly.
  3. Ollama makes it possible to run large language models on your own laptop or server without having to rely on the cloud. No API access is required as the models run directly on the device.

    For example, you can run a model such as Mistral, LLaMA 3 or DeepSeek locally on your device.

Databases & Vector Stores

In traditional data processing, relational databases (SQL databases) store structured data in tables, while NoSQL databases such as MongoDB or Cassandra are used to store unstructured or semi-structured data.

With LLMs, however, we now also need a way to store and search semantic information.

This requires vector databases: A foundation model does not process input as text, but converts it into numerical vectors – so-called embeddings. Vector databases make it possible to perform fast similarity and memory management for embeddings and thus provide relevant contextual information.

How does this work, for example, with Retrieval Augmented Generation?

  1. Each text (e.g. a paragraph from a PDF) is translated into a vector.
  2. You pass a query to the model as a prompt. For example, you ask a question. This question is now also translated into a vector.
  3. The database now calculates which vectors are closest to the input vector.
  4. These top results are made available to the LLM before it answers. And the model then uses this information additionally for the answer.

Examples of this are Pinecone, FAISS, Weaviate, Milvus, and Qdrant.

Programming Languages

Generative AI development also needs a programming language.

Of course, Python is probably the first choice for almost all AI applications. Python has established itself as the main language for AI & ML and is one of the most popular and widely used languages. It is flexible and offers a large AI ecosystem with all the previously mentioned frameworks such as TensorFlow, PyTorch, LangChain or LlamaIndex.

Why isn’t Python used for everything?

Python is not very fast. But thanks to CUDA backends, TensorFlow or PyTorch are still very performant. However, if performance is really very important, Rust, C++ or Go are more likely to be used.

Another language that must be mentioned is Rust: This language is used when it comes to fast, secure and memory-efficient AI infrastructures. For example, for efficient databases for vector searches or high-performance network communication. It is primarily used in the infrastructure and deployment area.

Julia is a language that is close to Python, but much faster – this makes it perfect for numerical calculations and tensor operations.

TypeScript or JavaScript are not directly relevant for AI applications but are often used in the front end of LLM applications (e.g., React or Next.js).

Own visualization — Illustrations from unDraw.co

2 Scaling AI: Infrastructure and Compute Power

Apart from the core components, we also need ways to scale and train the models.

Containers & Orchestration

Not only traditional applications, but also AI applications need to be provided and scaled. I wrote about containerisation in detail in this article Why Data Scientists Should Care about Containers – and Stand Out with This Knowledge. But at its core, the point is that with containers, we can run an AI model (or any other application) on any server and it works the same. This allows us to provide consistent, portable and scalable AI workloads.

Docker is the standard for containerisation. Generative AI is no different. We can use it to develop AI applications as isolated, repeatable units. Docker is used to deploy LLMs in the cloud or on edge devices. Edge means that the AI does not run in the cloud, but locally on your device. The Docker images contain everything you need: Python, ML frameworks such as PyTorch, CUDA for GPUs and AI APIs.

Let’s take a look at an example: A developer trains a model locally with PyTorch and saves it as a Docker container. This allows it to be easily deployed to AWS or Google Cloud.

Kubernetes is there to manage and scale container workloads. It can manage GPUs as resources. This makes it possible to run multiple models efficiently on a cluster – and to scale automatically when demand is high.

Kubeflow is less well-known outside of the AI world. It allows ML models to be orchestrated as a workflow from data processing to deployment. It is specifically designed for machine learning in production environments and supports automatic model training & hyperparameter training.

Chip manufacturers & AI hardware

The immense computing power that is required must be produced. This is done by chip manufacturers. Powerful hardware reduces training times and improves model inference.

There are now also some models that have been trained with fewer parameters or fewer resources for the same performance. When DeepSeek was published at the end of February, it was somewhat questioned how many resources are actually necessary. It is becoming increasingly clear that huge models and extremely expensive hardware are not always necessary.

Probably the best-known chip manufacturer in the field of AI is Nvidia, one of the most valuable companies. With its specialised A100 and H100 GPUs, the company has become the de facto standard for training and inferencing large AI models. In addition to Nvidia, however, there are other important players such as AMD with its Instinct MI300X series, Google, Amazon and Cerebras.

API Providers for Foundation Models

The Foundation Models are pre-trained models. We use APIs so that we can access them as quickly as possible without having to host them ourselves. API providers offer quick access to the models, such as OpenAI API, Hugging Face Inference Endpoints or Google Gemini API. To do this, you send a text via an API and receive the response back. However, APIs such as the OpenAI API are subject to a fee.

The best-known provider is OpenAI, whose API provides access to GPT-3.5, GPT-4, DALL-E for image generation and Whisper for speech-to-text. Anthropic also offers a powerful alternative with Claude 2 and 3. Google provides access to multimodal models such as Gemini 1.5 via the Gemini API.

Hugging Face is a central hub for open source models: the inference endpoints allow us to directly address Mistral 7B, Mixtral or Meta models, for example.

Another exciting provider is Cohere, which provides Command R+, a model specifically for Retrieval Augmented Generation (RAG) – including powerful embedding APIs.

Serverless AI architectures

Serverless computing does not mean that there is no server but that you do not need your own server. You only define what is to be executed – not how or where. The cloud environment then automatically starts an instance, executes the code and shuts the instance down again. The AWS Lambda functions, for example, are well-known here.

Something similar is also available specifically for AI. Serverless AI reduces the administrative effort and scales automatically. This is ideal, for example, for AI tasks that are used irregularly.

Let’s take a look at an example: A chatbot on a website that answers questions from customers doesn’t have to run all the time. However, when a visitor comes to the website and asks a question, it must have resources. It is, therefore, only called up when needed.

Serverless AI can save costs and reduce complexity. However, it is not useful for continuous, latency-critical tasks.

Examples: AWS Bedrock, Azure OpenAI Service, Google Cloud Vertex AI

3 The Social Layer of AI: Explainability, Fairness and Governance

With great power and capability comes responsibility. The more we integrate AI into our everyday applications, the more important it becomes to engage with the principles of Responsible AI.

So…Generative AI raises many questions:

  • Does the model explain how it arrives at its answers?
    -> Question about Transparency
  • Are certain groups favoured?
    -> Question about Fairness
  • How is it ensured that the model is not misused?
    -> Question about Security
  • Who is liable for errors?
    -> Question about Accountability
  • Who controls how and where AI is used?
    -> Question about Governance
  • Which available data from the web (e.g. images from
    artists) may be used?
    -> Question about Copyright / data ethics

While we have comprehensive regulations for many areas of the physical world — such as noise control, light pollution, vehicles, buildings, and alcohol sales — similar regulatory efforts in the IT sector are still rare and often avoided.

I’m not making a generalisation or a value judgment about whether this is good or bad. Less regulation can accelerate innovation – new technologies reach the market faster. At the same time, there is a risk that important aspects such as ethical responsibility, bias detection or energy consumption by large models will receive too little attention.

With the AI Act, the EU is focusing more on a regulated approach that is intended to create clear framework conditions – but this, in turn, can reduce the speed of innovation. The USA tends to pursue a market-driven, liberal approach with voluntary guidelines. This promotes rapid development but often leaves ethical and social issues in the background.

Let’s take a look at three concepts:

Explainability

Many large LLMs such as GPT-4 or Claude 3 are considered so-called black boxes: they provide impressive answers, but we do not know exactly how they arrive at these results. The more we entrust them with – especially in sensitive areas such as education, medicine or justice – the more important it becomes to understand their decision-making processes.

Tools such as LIME, SHAP or Attention Maps are ways of minimising these problems. They analyse model decisions and present them visually. In addition, model cards (standardised documentation) help to make the capabilities, training data, limitations and potential risks of a model transparent.

Fairness

If a model has been trained with data that contains biases or biased representations, it will also inherit these biases and distortions. This can lead to certain population groups being systematically disadvantaged or stereotyped. There are methods for recognising bias and clear standards for how training data should be selected and tested.

Governance

Finally, the question of governance arises: Who actually determines how AI may be used? Who checks whether a model is being operated responsibly?

4 Emerging Abilities: When AI Starts to Interact and Act

This is about the new capabilities that go beyond the classic prompt-response model. AI is becoming more active, more dynamic and more autonomous.

Let’s take a look at a concrete example:

A classic LLM like GPT-3 follows the typical process: For example, you ask a question like ‘Please show me how to create a button with rounded corners using HTML & CSS’. The model then provides you with the appropriate code, including a brief explanation. The model returns a pure text output without the model actively executing or thinking anything further.

Screenshot taken by the author: The answer from ChatGPT if we ask for creating buttons with rounded corners.

AI agents go much further. They not only analyse the prompt but also develop plans independently, access external tools or APIs and can complete tasks in several steps.

A simple example:

Instead of just writing the template for an email, an agent can monitor a data source and independently send an email as soon as a certain event occurs. For example, an email could go out when a sales target has been exceeded.

AI agents

AI agents are an application logic based on the Foundation Models. They orchestrate decisions and execute steps independently. Agents such as AutoGPT carry out multi-step tasks independently. They think in loops and try to improve or achieve a goal step by step.

Some examples:

  • Your AI agent analyzes new market reports daily, summarizes them, stores them in a database, and notifies the user in case of deviations.
  • An agent initiates a job application process: It scans submitted profiles and matches them with job offers.
  • In an e-commerce shop, the agent monitors inventory levels and customer demand. If a product is running low, it automatically reorders it – including price comparisons between suppliers.

What typically makes up an AI agent?

An AI agent consists of several specialized components, making it possible to autonomously plan, execute, and learn tasks:

  • Large Language Model
    The LLM is the core or thinking engine. Typical models include GPT-4, Claude 3, Gemini 1.5, or Mistral 7B.
  • Planning unit
    The planner transforms a higher-level goal into a concrete plan or sequence of steps. Often based on methods like Chain-of-Thought or ReAct.
  • Tool access
    This component enables the agent to use external tools. For example, using a browser for extended search, a Python environment for code execution or enabling access to APIs and databases.
  • Memory
    This component stores information about previous interactions, intermediate results, or contextual knowledge. This is necessary so that the agent can act consistently across multiple steps.
  • Executor
    This component executes the planned steps in the correct order, monitors progress, and replans in case of errors.

There are also tools like Make or n8n (low-code / no-code automation platforms), which also let you implement “agent-like” logic. They execute workflows with conditions, triggers, and actions. For example, an automated reply should be formulated when a new email arrives in the inbox. And there are a lot of templates for such use cases.

Screenshot taken by the author: Templates on n8n as an example for low-code or no-code platforms.

Reinforcement Learning

With reinforcement learning, the models are made more “human-friendly.” In this training method, the model learns through reward. This is especially important for tasks where there is no clear “right” or “wrong,” but rather gradual quality.

An example of this is when you use ChatGPT, receive two different responses and are asked to rate which one you prefer.

The reward can come either from human feedback (Reinforcement Learning from Human Feedback – RLHF) or from another model (Reinforcement Learning from AI Feedback – RLVR). In RLHF, a human rates several responses from a model, allowing the LLM to learn what “good” responses look like and better align with human expectations. In RLVR, the model doesn’t just receive binary feedback (e.g., good vs. bad) but differentiated, context-dependent rewards (e.g., a variable reward scale from -1 to +3). RLVR is especially useful where there are many possible “good” responses, but some match the user’s intent much better.

On my Substack, I regularly write summaries about the published articles in the fields of Tech, Python, Data Science, Machine Learning and AI. If you’re interested, take a look or subscribe.

Final Thoughts

It would probably be possible to write an entire book about Generative Ai right now – not just a single article. Artificial intelligence has been researched and applied for many years. But we are currently in a moment where an explosion of tools, applications, and frameworks is happening – AI, and especially generative AI, has truly arrived in our everyday lives. Let’s see where this takes us and end with a quote from Alan Kay:

The best way to predict the future is to invent it.

Where Can You Continue Learning?

The post Understanding the Tech Stack Behind Generative AI appeared first on Towards Data Science.

]]>