Artificial Intelligence | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Avishek Biswas — Sat, 12 Apr 2025 01:09:27 +0000

Recently, Sesame AI published a demo of their latest Speech-to-Speech model. A conversational AI agent who is really good at speaking, they provide relevant answers, they speak with expressions, and honestly, they are just very fun and interactive to play with.

Note that a technical paper is not out yet, but they do have a short blog post that provides a lot of information about the techniques they used and previous algorithms they built upon.

Thankfully, they provided enough information for me to write this article and make a YouTube video out of it. Read on!

Training a Conversational Speech Model

Sesame is a Conversational Speech Model, or a CSM. It inputs both text and audio, and generates speech as audio. While they haven’t revealed their training data sources in the articles, we can still try to take a solid guess. The blog post heavily cites another CSM, 2024’s Moshi, and fortunately, the creators of Moshi did reveal their data sources in their paper. Moshi uses 7 million hours of unsupervised speech data, 170 hours of natural and scripted conversations (for multi-stream training), and 2000 more hours of telephone conversations (The Fischer Dataset).

Sesame builds upon the Moshi Paper (2024)

But what does it really take to generate audio?

In raw form, audio is just a long sequence of amplitude values — a waveform. For example, if you’re sampling audio at 24 kHz, you are capturing 24,000 float values every second.

There are 24000 values here to represent 1 second of speech! (Image generated by author)

Of course, it is quite resource-intensive to process 24000 float values for just one second of data, especially because transformer computations scale quadratically with sequence length. It would be great if we could compress this signal and reduce the number of samples required to process the audio.

We will take a deep dive into the Mimi encoder and specifically Residual Vector Quantizers (RVQ), which are the backbone of Audio/Speech modeling in Deep Learning today. We will end the article by learning about how Sesame generates audio using its special dual-transformer architecture.

Preprocessing audio

Compression and feature extraction are where convolution helps us. Sesame uses the Mimi speech encoder to process audio. Mimi was introduced in the aforementioned Moshi paper as well. Mimi is a self-supervised audio encoder-decoder model that converts audio waveforms into discrete “latent” tokens first, and then reconstructs the original signal. Sesame only uses the encoder section of Mimi to tokenize the input audio tokens. Let’s learn how.

Mimi inputs the raw speech waveform at 24Khz, passes them through several strided convolution layers to downsample the signal, with a stride factor of 4, 5, 6, 8, and 2. This means that the first CNN block downsamples the audio by 4x, then 5x, then 6x, and so on. In the end, it downsamples by a factor of 1920, reducing it to just 12.5 frames per second.

The convolution blocks also project the original float values to an embedding dimension of 512. Each embedding aggregates the local features of the original 1D waveform. 1 second of audio is now represented as around 12 vectors of size 512. This way, Mimi reduces the sequence length from 24000 to just 12 and converts them into dense continuous vectors.

Before applying any quantization, the Mimi Encoder downsamples the input 24KHz audio by 1920 times, and embeds it into 512 dimensions. In other words, you get 12.5 frames per second with each frame as a 512-dimensional vector. (Image from author’s video)

What is Audio Quantization?

Given the continuous embeddings obtained after the convolution layer, we want to tokenize the input speech. If we can represent speech as a sequence of tokens, we can apply standard language learning transformers to train generative models.

Mimi uses a Residual Vector Quantizer or RVQ tokenizer to achieve this. We will talk about the residual part soon, but first, let’s look at what a simple vanilla Vector quantizer does.

Vector Quantization

The idea behind Vector Quantization is simple: you train a codebook , which is a collection of, say, 1000 random vector codes all of size 512 (same as your embedding dimension).

A Vanilla Vector Quantizer. A codebook of embeddings is trained. Given an input embedding, we map/quantize it to the nearest codebook entry. (Screenshot from author’s video)

Then, given the input vector, we will map it to the closest vector in our codebook — basically snapping a point to its nearest cluster center. This means we have effectively created a fixed vocabulary of tokens to represent each audio frame, because whatever the input frame embedding may be, we will represent it with the nearest cluster centroid. If you want to learn more about Vector Quantization, check out my video on this topic where I go much deeper with this.

More about Vector Quantization! (Video by author)

Residual Vector Quantization

The problem with simple vector quantization is that the loss of information may be too high because we are mapping each vector to its cluster’s centroid. This “snap” is rarely perfect, so there is always an error between the original embedding and the nearest codebook.

The big idea of Residual Vector Quantization is that it doesn’t stop at having just one codebook. Instead, it tries to use multiple codebooks to represent the input vector.

First, you quantize the original vector using the first codebook.
Then, you subtract that centroid from your original vector. What you’re left with is the residual — the error that wasn’t captured in the first quantization.
Now take this residual, and quantize it again, using a second codebook full of brand new code vectors — again by snapping it to the nearest centroid.
Subtract that too, and you get a smaller residual. Quantize again with a third codebook… and you can keep doing this for as many codebooks as you want.

Residual Vector Quantizers (RVQ) hierarchically encode the input embeddings by using a new codebook and VQ layer to represent the previous codebook’s error. (Illustration by the author)

Each step hierarchically captures a little more detail that was missed in the previous round. If you repeat this for, let’s say, N codebooks, you get a collection of N discrete tokens from each stage of quantization to represent one audio frame.

The coolest thing about RVQs is that they are designed to have a high inductive bias towards capturing the most essential content in the very first quantizer. In the subsequent quantizers, they learn more and more fine-grained features.

If you’re familiar with PCA, you can think of the first codebook as containing the primary principal components, capturing the most critical information. The subsequent codebooks represent higher-order components, containing information that adds more details.

Residual Vector Quantizers (RVQ) uses multiple codebooks to encode the input vector — one entry from each codebook. (Screenshot from author’s video)

Acoustic vs Semantic Codebooks

Since Mimi is trained on the task of audio reconstruction, the encoder compresses the signal to the discretized latent space, and the decoder reconstructs it back from the latent space. When optimizing for this task, the RVQ codebooks learn to capture the essential acoustic content of the input audio inside the compressed latent space.

Mimi also separately trains a single codebook (vanilla VQ) that only focuses on embedding the semantic content of the audio. This is why Mimi is called a split-RVQ tokenizer – it divides the quantization process into two independent parallel paths: one for semantic information and another for acoustic information.

The Mimi Architecture (Source: Moshi paper) License: Free

To train semantic representations, Mimi used knowledge distillation with an existing speech model called WavLM as a semantic teacher. Basically, Mimi introduces an additional loss function that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.

Audio Decoder

Given a conversation containing text and audio, we first convert them into a sequence of token embeddings using the text and audio tokenizers. This token sequence is then input into a transformer model as a time series. In the blog post, this model is referred to as the Autoregressive Backbone Transformer. Its task is to process this time series and output the “zeroth” codebook token.

A lighterweight transformer called the audio decoder then reconstructs the next codebook tokens conditioned on this zeroth code generated by the backbone transformer. Note that the zeroth code already contains a lot of information about the history of the conversation since the backbone transformer has visibility of the entire past sequence. The lightweight audio decoder only operates on the zeroth token and generates the other N-1 codes. These codes are generated by using N-1 distinct linear layers that output the probability of choosing each code from their corresponding codebooks.

You can imagine this process as predicting a text token from the vocabulary in a text-only LLM. Just that a text-based LLM has a single vocabulary, but the RVQ-tokenizer has multiple vocabularies in the form of the N codebooks, so you need to train a separate linear layer to model the codes for each.

The Sesame Architecture (Illustration by the author)

Finally, after the codewords are all generated, we aggregate them to form the combined continuous audio embedding. The final job is to convert this audio back to a waveform. For this, we apply transposed convolutional layers to upscale the embedding back from 12.5 Hz back to KHz waveform audio. Basically, reversing the transforms we had applied originally during audio preprocessing.

In Summary

Check out the accompanying video on this article! (Video by author)

So, here is the overall summary of the Sesame model in some bullet points.

Sesame is built on a multimodal Conversation Speech Model or a CSM.
Text and audio are tokenized together to form a sequence of tokens and input into the backbone transformer that autoregressively processes the sequence.
While the text is processed like any other text-based LLM, the audio is processed directly from its waveform representation. They use the Mimi encoder to convert the waveform into latent codes using a split RVQ tokenizer.
The multimodal backbone transformers consume a sequence of tokens and predict the next zeroth codeword.
Another lightweight transformer called the Audio Decoder predicts the next codewords from the zeroth codeword.
The final audio frame representation is generated from combining all the generated codewords and upsampled back to the waveform representation.

Thanks for reading!

References and Must-read papers

Check out my ML YouTube Channel

Sesame Blogpost and Demo

Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692

The post Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 6: The Human Side

David Martin — Fri, 11 Apr 2025 18:44:39 +0000

In my previous articles, I have spent a lot of time talking about the technical aspects of an Image Classification problem from data collection, model evaluation, performance optimization, and a detailed look at model training.

These elements require a certain degree of in-depth expertise, and they (usually) have well-defined metrics and established processes that are within our control.

Now it’s time to consider…

The human aspects of machine learning

Yes, this may seem like an oxymoron! But it is the interaction with people — the ones you work with and the ones who use your application — that help bring the technology to life and provide a sense of fulfillment to your work.

These human interactions include:

Communicating technical concepts to a non-technical audience.
Understanding how your end-users engage with your application.
Providing clear expectations on what the model can and cannot do.

I also want to touch on the impact to people’s jobs, both positive and negative, as AI becomes a part of our everyday lives.

Overview

As in my previous articles, I will gear this discussion around an image classification application. With that in mind, these are the groups of people involved with your project:

AI/ML Engineer (that’s you) — bringing life to the Machine Learning application.
MLOps team — your peers who will deploy, monitor, and enhance your application.
Subject matter experts — the ones who will provide the care and feeding of labeled data.
Stakeholders — the ones who are looking for a solution to a real world problem.
End-users — the ones who will be using your application. These could be internal and external customers.
Marketing — the ones who will be promoting usage of your application.
Leadership — the ones who are paying the bill and need to see business value.

Let’s dive right in…

AI/ML Engineer

You may be a part of a team or a lone wolf. You may be an individual contributor or a team leader.

Photo by Christina @ wocintechchat.com on Unsplash

Whatever your role, it is important to see the whole picture — not only the coding, the data science, and the technology behind AI/ML — but the value that it brings to your organization.

Understand the business needs

Your company faces many challenges to reduce expenses, improve customer satisfaction, and remain profitable. Position yourself as someone who can create an application that helps achieve their goals.

What are the pain points in a business process?
What is the value of using your application (time savings, cost savings)?
What are the risks of a poor implementation?
What is the roadmap for future enhancements and use-cases?
What other areas of the business could benefit from the application, and what design choices will help future-proof your work?

Communication

Deep technical discussions with your peers is probably our comfort zone. However, to be a more successful AI/ML Engineer, you should be able to clearly explain the work you are doing to different audiences.

With practice, you can explain these topics in ways that your non-technical business users can follow along with, and understand how your technology will benefit them.

To help you get comfortable with this, try creating a PowerPoint with 2–3 slides that you can cover in 5–10 minutes. For example, explain how a neural network can take an image of a cat or a dog and determine which one it is.

Practice giving this presentation in your mind, to a friend — even your pet dog or cat! This will get you more comfortable with the transitions, tighten up the content, and ensure you cover all the important points as clearly as possible.

Be sure to include visuals — pure text is boring, graphics are memorable.
Keep an eye on time — respect your audience’s busy schedule and stick to the 5–10 minutes you are given.
Put yourself in their shoes — your audience is interested in how the technology will benefit them, not on how smart you are.

Creating a technical presentation is a lot like the Feynman Technique — explaining a complex subject to your audience by breaking it into easily digestible pieces, with the added benefit of helping you understand it more completely yourself.

MLOps team

These are the people that deploy your application, manage data pipelines, and monitor infrastructure that keeps things running.

Without them, your model lives in a Jupyter notebook and helps nobody!

Photo by airfocus on Unsplash

These are your technical peers, so you should be able to connect with their skillset more naturally. You speak in jargon that sounds like a foreign language to most people. Even so, it is extremely helpful for you to create documentation to set expectations around:

Process and data flows.
Data quality standards.
Service level agreements for model performance and availability.
Infrastructure requirements for compute and storage.
Roles and responsibilities.

It is easy to have a more informal relationship with your MLOps team, but remember that everyone is trying to juggle many projects at the same time.

Email and chat messages are fine for quick-hit issues. But for larger tasks, you will want a system to track things like user stories, enhancement requests, and break-fix issues. This way you can prioritize the work and ensure you don’t forget something. Plus, you can show progress to your supervisor.

Some great tools exist, such as:

Jira, GitHub, Azure DevOps Boards, Asana, Monday, etc.

We are all professionals, so having a more formal system to avoid miscommunication and mistrust is good business.

Subject matter experts

These are the team members that have the most experience working with the data that you will be using in your AI/ML project.

Photo by National Cancer Institute on Unsplash

SMEs are very skilled at dealing with messy data — they are human, after all! They can handle one-off situations by considering knowledge outside of their area of expertise. For example, a doctor may recognize metal inserts in a patient’s X-ray that indicate prior surgery. They may also notice a faulty X-ray image due to equipment malfunction or technician error.

However, your machine learning model only knows what it knows, which comes from the data it was trained on. So, those one-off cases may not be appropriate for the model you are training. Your SMEs need to understand that clear, high quality training material is what you are looking for.

Think like a computer

In the case of an image classification application, the output from the model communicates to you how well it was trained on the data set. This comes in the form of error rates, which is very much like when a student takes an exam and you can tell how well they studied by seeing how many questions — and which ones — they get wrong.

In order to reduce error rates, your image data set needs to be objectively “good” training material. To do this, put yourself in an analytical mindset and ask yourself:

What images will the computer get the most useful information out of? Make sure all the relevant features are visible.
What is it about an image that confused the model? When it makes an error, try to understand why — objectively — by looking at the entire picture.
Is this image a “one-off” or a typical example of what the end-users will send? Consider creating a new subclass of exceptions to the norm.

Be sure to communicate to your SMEs that model performance is directly tied to data quality and give them clear guidance:

Provide visual examples of what works.
Provide counter-examples of what does not work.
Ask for a wide variety of data points. In the X-ray example, be sure to get patients with different ages, genders, and races.
Provide options to create subclasses of your data for further refinement. Use that X-ray from a patient with prior surgery as a subclass, and eventually as you can get more examples over time, the model can handle them.

This also means that you should become familiar with the data they are working with — perhaps not expert level, but certainly above a novice level.

Lastly, when working with SMEs, be cognizant of the impression they may have that the work you are doing is somehow going to replace their job. It can feel threatening when someone asks you how to do your job, so be mindful.

Ideally, you are building a tool with honest intentions and it will enable your SMEs to augment their day-to-day work. If they can use the tool as a second opinion to validate their conclusions in less time, or perhaps even avoid mistakes, then this is a win for everyone. Ultimately, the goal is to allow them to focus on more challenging situations and achieve better outcomes.

I have more to say on this in my closing remarks.

Stakeholders

These are the people you will have the closest relationship with.

Stakeholders are the ones who created the business case to have you build the machine learning model in the first place.

Photo by Ninthgrid on Unsplash

They have a vested interest in having a model that performs well. Here are some key point when working with your stakeholder:

Be sure to listen to their needs and requirements.
Anticipate their questions and be prepared to respond.
Be on the lookout for opportunities to improve your model performance. Your stakeholders may not be as close to the technical details as you are and may not think there is any room for improvement.
Bring issues and problems to their attention. They may not want to hear bad news, but they will appreciate honesty over evasion.
Schedule regular updates with usage and performance reports.
Explain technical details in terms that are easy to understand.
Set expectations on regular training and deployment cycles and timelines.

Your role as an AI/ML Engineer is to bring to life the vision of your stakeholders. Your application is making their lives easier, which justifies and validates the work you are doing. It’s a two-way street, so be sure to share the road.

End-users

These are the people who are using your application. They may also be your harshest critics, but you may never even hear their feedback.

Photo by Alina Ruf on Unsplash

Think like a human

Recall above when I suggested to “think like a computer” when analyzing the data for your training set. Now it’s time to put yourself in the shoes of a non-technical user of your application.

End-users of an image classification model communicate their understanding of what’s expected of them by way of poor images. These are like the students that didn’t study for the exam, or worse didn’t read the questions, so their answers don’t make sense.

Your model may be really good, but if end-users misuse the application or are not satisfied with the output, you should be asking:

Are the instructions confusing or misleading? Did the user focus the camera on the subject being classified, or is it more of a wide-angle image? You can’t blame the user if they follow bad instructions.
What are their expectations? When the results are presented to the user, are they satisfied or are they frustrated? You may noticed repeated images from frustrated users.
Are the usage patterns changing? Are they trying to use the application in unexpected ways? This may be an opportunity to improve the model.

Inform your stakeholders of your observations. There may be simple fixes to improve end-user satisfaction, or there may be more complex work ahead.

If you are lucky, you may discover an unexpected way to leverage the application that leads to expanded usage or exciting benefits to your business.

Explainability

Most AI/ML model are considered “black boxes” that perform millions of calculations on extremely high dimensional data and produce a rather simplistic result without any reason behind it.

The Answer to Ultimate Question of Life, the Universe, and Everything is 42.
— The Hitchhikers Guide to the Galaxy

Depending on the situation, your end-users may require more explanation of the results, such as with medical imaging. Where possible, you should consider incorporating model explainability techniques such as LIME, SHAP, and others. These responses can help put a human touch to cold calculations.

Now it’s time to switch gears and consider higher-ups in your organization.

Marketing team

These are the people who promote the use of your hard work. If your end-users are completely unaware of your application, or don’t know where to find it, your efforts will go to waste.

The marketing team controls where users can find your app on your website and link to it through social media channels. They also see the technology through a different lens.

Gartner hype cycle. Image from Wikipedia – https://en.wikipedia.org/wiki/Gartner_hype_cycle

The above hype cycle is a good representation of how technical advancements tends to flow. At the beginning, there can be an unrealistic expectation of what your new AI/ML tool can do — it’s the greatest thing since sliced bread!

Then the “new” wears off and excitement wanes. You may face a lack of interest in your application and the marketing team (as well as your end-users) move on to the next thing. In reality, the value of your efforts are somewhere in the middle.

Understand that the marketing team’s interest is in promoting the use of the tool because of how it will benefit the organization. They may not need to know the technical inner workings. But they should understand what the tool can do, and be aware of what it cannot do.

Honest and clear communication up-front will help smooth out the hype cycle and keep everyone interested longer. This way the crash from peak expectations to the trough of disillusionment is not so severe that the application is abandoned altogether.

Leadership team

These are the people that authorize spending and have the vision for how the application fits into the overall company strategy. They are driven by factors that you have no control over and you may not even be aware of. Be sure to provide them with the key information about your project so they can make informed decisions.

Photo by Adeolu Eletu on Unsplash

Depending on your role, you may or may not have direct interaction with executive leadership in your company. Your job is to summarize the costs and benefits associated with your project, even if that is just with your immediate supervisor who will pass this along.

Your costs will likely include:

Compute and storage — training and serving a model.
Image data collection — both real-world and synthetic or staged.
Hours per week — SME, MLOps, AI/ML engineering time.

Highlight the savings and/or value added:

Provide measures on speed and accuracy.
Translate efficiencies into FTE hours saved and customer satisfaction.
Bonus points if you can find a way to produce revenue.

Business leaders, much like the marketing team, may follow the hype cycle:

Be realistic about model performance. Don’t try to oversell it, but be honest about the opportunities for improvement.
Consider creating a human benchmark test to measure accuracy and speed for an SME. It is easy to say human accuracy is 95%, but it’s another thing to measure it.
Highlight short-term wins and how they can become long-term success.

Conclusion

I hope you can see that, beyond the technical challenges of creating an AI/ML application, there are many humans involved in a successful project. Being able to interact with these individuals, and meet them where they are in terms of their expectations from the technology, is vital to advancing the adoption of your application.

Photo by Vlad Hilitanu on Unsplash

Key takeaways:

Understand how your application fits into the business needs.
Practice communicating to a non-technical audience.
Collect measures of model performance and report these regularly to your stakeholders.
Expect that the hype cycle could help and hurt your cause, and that setting consistent and realistic expectations will ensure steady adoption.
Be aware that factors outside of your control, such as budgets and business strategy, could affect your project.

And most importantly…

Don’t let machines have all the fun learning!

Human nature gives us the curiosity we need to understand our world. Take every opportunity to grow and expand your skills, and remember that human interaction is at the heart of machine learning.

Closing remarks

Advancements in AI/ML have the potential (assuming they are properly developed) to do many tasks as well as humans. It would be a stretch to say “better than” humans because it can only be as good as the training data that humans provide. However, it is safe to say AI/ML can be faster than humans.

The next logical question would be, “Well, does that mean we can replace human workers?”

This is a delicate topic, and I want to be clear that I am not an advocate of eliminating jobs.

I see my role as an AI/ML Engineer as being one that can create tools that aide in someone else’s job or enhance their ability to complete their work successfully. When used properly, the tools can validate difficult decisions and speed through repetitive tasks, allowing your experts to spend more time on the one-off situations that require more attention.

There may also be new career opportunities, from the care-and-feeding of data, quality assessment, user experience, and even to new roles that leverage the technology in exciting and unexpected ways.

Unfortunately, business leaders may make decisions that impact people’s jobs, and this is completely out of your control. But all is not lost — even for us AI/ML Engineers…

There are things we can do

Be kind to the fellow human beings that we call “coworkers”.
Be aware of the fear and uncertainty that comes with technological advancements.
Be on the lookout for ways to help people leverage AI/ML in their careers and to make their lives better.

This is all part of being human.

The post Learnings from a Machine Learning Engineer — Part 6: The Human Side appeared first on Towards Data Science.

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Salvatore Raieli — Fri, 11 Apr 2025 05:44:46 +0000

Liberating education consists in acts of cognition, not transferrals of information.
Paulo freire

One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing?

Many authors suggest that artificial intelligence models do not possess the same capabilities as humans, especially when it comes to plasticity, flexibility, and adaptation.

One of the aspects that models do not capture are several causal relationships about the external world.

This article discusses these issues:

The parallelism between convolutional neural networks (CNNs) and the human visual cortex
Limitations of CNNs in understanding causal relations and learning abstract concepts
How to make CNNs learn simple causal relations

Is it the same? Is it different?

Convolutional networks (CNNs) [2] are multi-layered neural networks that take images as input and can be used for multiple tasks. One of the most fascinating aspects of CNNs is their inspiration from the human visual cortex [1]:

Hierarchical processing. The visual cortex processes images hierarchically, where early visual areas capture simple features (such as edges, lines, and colors) and deeper areas capture more complex features such as shapes, objects, and scenes. CNN, due to its layered structure, captures edges and textures in the early layers, while layers further down capture parts or whole objects.
Receptive fields. Neurons in the visual cortex respond to stimuli in a specific local region of the visual field (commonly called receptive fields). As we go deeper, the receptive fields of the neurons widen, allowing more spatial information to be integrated. Thanks to pooling steps, the same happens in CNNs.
Feature sharing. Although biological neurons are not identical, similar features are recognized across different parts of the visual field. In CNNs, the various filters scan the entire image, allowing patterns to be recognized regardless of location.
Spatial invariance. Humans can recognize objects even when they are moved, scaled, or rotated. CNNs also possess this property.

The relationship between components of the visual system and CNN. Image source: here

These features have made CNNs perform well in visual tasks to the point of superhuman performance:

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well-trained on the validation images to be better aware of the existence of relevant classes. […] Our result (4.94%) exceeds the reported human-level performance. —source [3]

Although CNNs perform better than humans in several tasks, there are still cases where they fail spectacularly. For example, in a 2024 study [4], AI models failed to generalize image classification. State-of-the-art models perform better than humans for objects on upright poses but fail when objects are on unusual poses.

The right label is on the top of the object, and the AI wrong predicted label is below. Image source: here

In conclusion, our results show that (1) humans are still much more robust than most networks at recognizing objects in unusual poses, (2) time is of the essence for such ability to emerge, and (3) even time-limited humans are dissimilar to deep neural networks. —source [4]

In the study [4], they note that humans need time to succeed in a task. Some tasks require not only visual recognition but also abstractive cognition, which requires time.

The generalization abilities that make humans capable come from understanding the laws that govern relations among objects. Humans recognize objects by extrapolating rules and chaining these rules to adapt to new situations. One of the simplest rules is the “same-different relation”: the ability to define whether two objects are the same or different. This ability develops rapidly during infancy and is also importantly associated with language development [5-7]. In addition, some animals such as ducks and chimpanzees also have it [8]. In contrast, learning same-different relations is very difficult for neural networks [9-10].

Example of a same-different task for a CNN. The network should return a label of 1 if the two objects are the same or a label of 0 if they are different. Image source: here

Convolutional networks show difficulty in learning this relationship. Likewise, they fail to learn other types of causal relationships that are simple for humans. Therefore, many researchers have concluded that CNNs lack the inductive bias necessary to be able to learn these relationships.

These negative results do not mean that neural networks are completely incapable of learning same-different relations. Much larger and longer trained models can learn this relation. For example, vision-transformer models pre-trained on ImageNet with contrastive learning can show this ability [12].

Can CNNs learn same-different relationships?

The fact that broad models can learn these kinds of relationships has rekindled interest in CNNs. The same-different relationship is considered among the basic logical operations that make up the foundations for higher-order cognition and reasoning. Showing that shallow CNNs can learn this concept would allow us to experiment with other relationships. Moreover, it will allow models to learn increasingly complex causal relationships. This is an important step in advancing the generalization capabilities of AI.

Previous work suggests that CNNs do not have the architectural inductive biases to be able to learn abstract visual relations. Other authors assume that the problem is in the training paradigm. In general, the classical gradient descent is used to learn a single task or a set of tasks. Given a task t or a set of tasks T, a loss function L is used to optimize the weights φ that should minimize the function L:

Image source from here

This can be viewed as simply the sum of the losses across different tasks (if we have more than one task). Instead, the Model-Agnostic Meta-Learning (MAML) algorithm [13] is designed to search for an optimal point in weight space for a set of related tasks. MAML seeks to find an initial set of weights θ that minimizes the loss function across tasks, facilitating rapid adaptation:

Image source from here

The difference may seem small, but conceptually, this approach is directed toward abstraction and generalization. If there are multiple tasks, traditional training tries to optimize weights for different tasks. MAML tries to identify a set of weights that is optimal for different tasks but at the same time equidistant in the weight space. This starting point θ allows the model to generalize more effectively across different tasks.

Meta-learning initial weights for generalization. Image source from here

Since we now have a method biased toward generalization and abstraction, we can test whether we can make CNNs learn the same-different relationship.

In this study [11], they compared shallow CNNs trained with classic gradient descent and meta-learning on a dataset designed for this report. The dataset consists of 10 different tasks that test for the same-different relationship.

The Same-Different dataset. Image source from here

The authors [11] compare CNNs of 2, 4, or 6 layers trained in a traditional way or with meta-learning, showing several interesting results:

The performance of traditional CNNs shows similar behavior to random guessing.
Meta-learning significantly improves performance, suggesting that the model can learn the same-different relationship. A 2-layer CNN performs little better than chance, but by increasing the depth of the network, performance improves to near-perfect accuracy.

Comparison between traditional training and meta-learning for CNNs. Image source from here

One of the most intriguing results of [11] is that the model can be trained in a leave-one-out way (use 9 tasks and leave one out) and show out-of-distribution generalization capabilities. Thus, the model has learned abstracting behavior that is hardly seen in such a small model (6 layers).

out-of-distribution for same-different classification. Image source from here

Conclusions

Although convolutional networks were inspired by how the human brain processes visual stimuli, they do not capture some of its basic capabilities. This is especially true when it comes to causal relations or abstract concepts. Some of these relationships can be learned from large models only with extensive training. This has led to the assumption that small CNNs cannot learn these relations due to a lack of architecture inductive bias. In recent years, efforts have been made to create new architectures that could have an advantage in learning relational reasoning. Yet most of these architectures fail to learn these kinds of relationships. Intriguingly, this can be overcome through the use of meta-learning.

The advantage of meta-learning is to incentivize more abstractive learning. Meta-learning pressure toward generalization, trying to optimize for all tasks at the same time. To do this, learning more abstract features is favored (low-level features, such as the angles of a particular shape, are not useful for generalization and are disfavored). Meta-learning allows a shallow CNN to learn abstract behavior that would otherwise require many more parameters and training.

The shallow CNNs and same-different relationship are a model for higher cognitive functions. Meta-learning and different forms of training could be useful to improve the reasoning capabilities of the models.

Another thing!

You can look for my other articles on Medium, and you can also connect or reach me on LinkedIn or in Bluesky. Check this repository, which contains weekly updated ML & AI news, or here for other tutorials and here for AI reviews. I am open to collaborations and projects, and you can reach me on LinkedIn.

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Lindsay, 2020, Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future, link
Li, 2020, A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects, link
He, 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, link
Ollikka, 2024, A comparison between humans and AI at recognizing objects in unusual poses, link
Premark, 1981, The codes of man and beasts, link
Blote, 1999, Young children’s organizational strategies on a same–different task: A microgenetic study and a training study, link
Lupker, 2015, Is there phonologically based priming in the same-different task? Evidence from Japanese-English bilinguals, link
Gentner, 2021, Learning same and different relations: cross-species comparisons, link
Kim, 2018, Not-so-clevr: learning same–different relations strains feedforward neural networks, link
Puebla, 2021, Can deep convolutional neural networks support relational reasoning in the same-different task? link
Gupta, 2025, Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation, link
Tartaglini, 2023, Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations, link
Finn, 2017, Model-agnostic meta-learning for fast adaptation of deep networks, link

The post The Basis of Cognitive Complexity: Teaching CNNs to See Connections appeared first on Towards Data Science.

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

Iason Solomos — Thu, 10 Apr 2025 05:14:56 +0000

Introduction

I’ve always been fascinated by debates—the strategic framing, the sharp retorts, and the carefully timed comebacks. Debates aren’t just entertaining; they’re structured battles of ideas, driven by logic and evidence. Recently, I started wondering: could we replicate that dynamic using AI agents—having them debate each other autonomously, complete with real-time fact-checking and moderation? The result was Deb8flow, an autonomous AI debating environment powered by LangGraph, OpenAI’s GPT-4o model, and the new integrated Web Search feature.

In Deb8flow, two agents—Pro and Con—square off on a given topic while a Moderator manages turn-taking. A dedicated Fact Checker reviews every claim in real time using GPT-4o’s new browsing capabilities, and a final Judge evaluates the arguments for quality and coherence. If an agent repeatedly makes factual errors, they’re automatically disqualified—ensuring the debate stays grounded in truth.

This article offers an in-depth look at the advanced architecture and dynamic workflows that power autonomous AI debates. I’ll walk you through how Deb8flow’s modular design leverages LangGraph’s state management and conditional routing, alongside GPT-4o’s capabilities.

Even if you’re new to AI agents or LangGraph (see resources [1] and [2] for primers), I’ll explain the key concepts clearly. And if you’d like to explore further, the full project is available on GitHub: iason-solomos/Deb8flow.

Ready to see how AI agents can debate autonomously in practice?

Let’s dive in.

High-Level Overview: Autonomous Debates with Multiple Agents

In Deb8flow, we orchestrate a formal debate between two AI agents – one arguing Pro and one Con – complete with a Moderator, a Fact Checker, and a final Judge. The debate unfolds autonomously, with each agent playing a role in a structured format.

At its core, Deb8flow is a LangGraph-powered agent system, built atop LangChain, using GPT-4o to power each role—Pro, Con, Judge, and beyond. We use GPT-4o’s preview model with browsing capabilities to enable real-time fact-checking. In essence, the Pro and Con agents debate; after each statement, a fact-checker agent uses GPT-4o’s web search to catch any hallucinations or inaccuracies in that statement in real time. The debate only continues once the statement is verified. The whole process is coordinated by a LangGraph-defined workflow that ensures proper turn-taking and conditional logic.

High-level debate flow graph. Each rectangle is an agent node (Pro/Con debaters, Fact Checker, Judge, etc.), and diamonds are control nodes (Moderator and a router after fact-checking). Solid arrows denote the normal progression, while dashed arrows indicate retries if a claim fails fact-check. The Judge node outputs the final verdict, then the workflow ends.
Image generated by the author with DALL-E

The debate workflow goes through these stages:

Topic Generation: A Topic Generator agent produces a nuanced, debatable topic for the session (e.g. “Should AI be used in classroom education?”).
Opening: The Pro Argument Agent makes an opening statement in favor of the topic, kicking off the debate.
Rebuttal: The Debate Moderator then gives the floor to the Con Argument agent, who rebuts the Pro’s opening statement.
Counter: The Moderator gives the floor back to the Pro agent, who counters the Con agent’s points.
Closing: The Moderator switches the floor to the Con agent one last time for a closing argument.
Judgment: Finally, the Judge agent reviews the full debate history and evaluates both sides based on argument quality, clarity, and persuasiveness. The most convincing side wins.

After every single speech, the Fact Checker agent steps in to verify the factual accuracy of that statement. If a debater’s claim doesn’t hold up (e.g. cites a wrong statistic or “hallucinates” a fact), the workflow triggers a retry: the speaker has to correct or modify their statement. (If either debater accumulates 3 fact-check failures, they are automatically disqualified for repeatedly spreading inaccuracies, and their opponent wins by default.) This mechanism keeps our AI debaters honest and grounded in reality!

Prerequisites and Setup

Before diving into the code, make sure you have the following in place:

Python 3.12+ installed.
An OpenAI API key with access to the GPT-4o model. You can create your own API key here: https://platform.openai.com/settings/organization/api-keys
Project Code: Clone the Deb8flow repository from GitHub (git clone https://github.com/iason-solomos/Deb8flow.git). The repo includes a requirements.txt for all required packages. Key dependencies include LangChain/LangGraph (for building the agent graph) and the OpenAI Python client.
Install Dependencies: In your project directory, run: pip install -r requirements.txt to install the necessary libraries.
Create a .env file in the project root to hold your OpenAI API credentials. It should be of the form: OPENAI_API_KEY_GPT4O = "sk-…"
You can also at any time check out the README file: https://github.com/iason-solomos/Deb8flow if you simply want to run the finished app.

Once dependencies are installed and the environment variable is set, you should be ready to run the app. The project structure is organized for clarity:

Deb8flow/
├── configurations/
│ ├── debate_constants.py
│ └── llm_config.py
├── nodes/
│ ├── base_component.py
│ ├── topic_generator_node.py
│ ├── pro_debater_node.py
│ ├── con_debater_node.py
│ ├── debate_moderator_node.py
│ ├── fact_checker_node.py
│ ├── fact_check_router_node.py
│ └── judge_node.py
├── prompts/
│ ├── topic_generator_prompts.py
│ ├── pro_debater_prompts.py
│ ├── con_debater_prompts.py
│ └── … (prompts for other agents)
├── tests/ (contains unit and whole workflow tests)
└── debate_workflow.py

A quick tour of this structure:

configurations/ holds constant definitions and LLM configuration classes.

nodes/ contains the implementation of each agent or functional node in the debate (each of these is a module defining one agent’s behavior).

prompts/ stores the prompt templates for the language model (so each agent knows how to prompt GPT-4o for its specific task).

debate_workflow.py ties everything together by defining the LangGraph workflow (the graph of nodes and transitions).

debate_state.py defines the shared data structure that the agents will be using on each run.

tests/ includes some basic tests and example runs to help you verify everything is working.

Under the Hood: State Management and Workflow Setup

To coordinate a complex multi-turn debate, we need a shared state and a well-defined flow. We’ll start by looking at how Deb8flow defines the debate state and constants, and then see how the LangGraph workflow is constructed.

Defining the Debate State Schema (`debate_state.py`)

Deb8flow uses a shared state (https://langchain-ai.github.io/langgraph/concepts/low_level/#state ) in the form of a Python TypedDict that all agents can read from and update. This state tracks the debate’s progress and context – things like the topic, the history of messages, whose turn it is, etc. By centralizing this information, each agent node can make decisions based on the current state of the debate.

Link: debate_state.py

from typing import TypedDict, List, Dict, Literal


DebateStage = Literal["opening", "rebuttal", "counter", "final_argument"]

class DebateMessage(TypedDict):
    speaker: str  # e.g. pro or con
    content: str  # The message each speaker produced
    validated: bool  # Whether the FactChecker ok’d this message
    stage: DebateStage # The stage of the debate when this message was produced

class DebateState(TypedDict):
    debate_topic: str
    positions: Dict[str, str]
    messages: List[DebateMessage]
    opening_statement_pro_agent: str
    stage: str  # "opening", "rebuttal", "counter", "final_argument"
    speaker: str  # "pro" or "con"
    times_pro_fact_checked: int # The number of times the pro agent has been fact-checked. If it reaches 3, the pro agent is disqualified.
    times_con_fact_checked: int # The number of times the con agent has been fact-checked. If it reaches 3, the con agent is disqualified.

Key fields that we need to have in the DebateState include:

debate_topic (str): The topic being debated.
messages (List[DebateMessage]): A list of all messages exchanged so far. Each message is a dictionary with fields for speaker (e.g. "pro" or "con" or "fact_checker"), the message content (text), a validated flag (whether it passed fact-check), and the stage of the debate when it was produced.
stage (str): The current debate stage (one of "opening", "rebuttal", "counter", "final_argument").
speaker (str): Whose turn it is currently ("pro" or "con").
times_pro_fact_checked / times_con_fact_checked (int): Counters for how many times each side has been caught with a false claim. (In our rules, if a debater fails fact-check 3 times, they could be disqualified or automatically lose.)
positions (Dict[str, str]): (Optional) A mapping of each side’s general stance (e.g., "pro": "In favor of the topic").

By structuring the debate’s state, agents find it easy to access the conversation history or check the current stage, and the control logic can update the state between turns. The state is essentially the memory of the debate.

Constants and Configuration

To avoid “magic strings” scattered in the code, we define some constants in debate_constants.py. For example, constants for stage names (STAGE_OPENING = "opening", etc.), speaker identifiers (SPEAKER_PRO = "pro", SPEAKER_CON = "con", etc.), and node names (NODE_PRO_DEBATER = "pro_debater_node", etc.). These make the code easier to maintain and read.

debate_constants.py:

# Stage names
STAGE_OPENING = "opening"
STAGE_REBUTTAL = "rebuttal"
STAGE_COUNTER = "counter"
STAGE_FINAL_ARGUMENT = "final_argument"
STAGE_END = "end"

# Speakers
SPEAKER_PRO = "pro"
SPEAKER_CON = "con"
SPEAKER_JUDGE = "judge"

# Node names
NODE_PRO_DEBATER = "pro_debater_node"
NODE_CON_DEBATER = "con_debater_node"
NODE_DEBATE_MODERATOR = "debate_moderator_node"
NODE_JUDGE = "judge_node"

We also set up LLM configuration in llm_config.py. Here, we define classes for OpenAI or Azure OpenAI configs and then create a dictionary llm_config_map mapping model names to their config. For instance, we map "gpt-4o" to an OpenAILLMConfig that holds the model name and API key. This way, whenever we need to initialize a GPT-4o agent, we can just do llm_config_map["gpt-4o"] to get the right config. All our main agents (debaters, topic generator, judge) use this same GPT-4o configuration.

import os
from dataclasses import dataclass
from typing import Union

@dataclass
class OpenAILLMConfig:
    """
    A data class to store configuration details for OpenAI models.

    Attributes:
        model_name (str): The name of the OpenAI model to use.
        openai_api_key (str): The API key for authenticating with the OpenAI service.
    """
    model_name: str
    openai_api_key: str


llm_config_map = {
    "gpt-4o": OpenAILLMConfig(
        model_name="gpt-4o",
        openai_api_key=os.getenv("OPENAI_API_KEY_GPT4O"),
    )
}

Building the LangGraph Workflow (`debate_workflow.py`)

With state and configs in place, we construct the debate workflow graph. LangGraph’s StateGraph is the backbone that connects all our agent nodes in the order they should execute. Here’s how we set it up:

class DebateWorkflow:

    def _initialize_workflow(self) -> StateGraph:
        workflow = StateGraph(DebateState)
        # Nodes
        workflow.add_node("generate_topic_node", GenerateTopicNode(llm_config_map["gpt-4o"]))
        workflow.add_node("pro_debater_node", ProDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("con_debater_node", ConDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("fact_check_node", FactCheckNode())
        workflow.add_node("fact_check_router_node", FactCheckRouterNode())
        workflow.add_node("debate_moderator_node", DebateModeratorNode())
        workflow.add_node("judge_node", JudgeNode(llm_config_map["gpt-4o"]))

        # Entry point
        workflow.set_entry_point("generate_topic_node")

        # Flow
        workflow.add_edge("generate_topic_node", "pro_debater_node")
        workflow.add_edge("pro_debater_node", "fact_check_node")
        workflow.add_edge("con_debater_node", "fact_check_node")
        workflow.add_edge("fact_check_node", "fact_check_router_node")
        workflow.add_edge("judge_node", END)
        return workflow



    async def run(self):
        workflow = self._initialize_workflow()
        graph = workflow.compile()
        # graph.get_graph().draw_mermaid_png(output_file_path="workflow_graph.png")
        initial_state = {
            "topic": "",
            "positions": {}
        }
        final_state = await graph.ainvoke(initial_state, config={"recursion_limit": 50})
        return final_state

Let’s break down what’s happening:

We initialize a new StateGraph with our DebateState type as the state schema.
We add each node (agent) to the graph with a name. For nodes that need an LLM, we pass in the GPT-4o config. For example, "pro_debater_node" is added as ProDebaterNode(llm_config_map["gpt-4o"]), meaning the Pro debater agent will use GPT-4o as its underlying model.
We set the entry point of the graph to "generate_topic_node". This means the first step of the workflow is to generate a debate topic.
Then we add directed edges to connect nodes. The edges above encode the primary sequence: topic -> pro’s turn -> fact-check -> (then a routing decision) -> … eventually -> judge -> END. We don’t connect the Moderator or Fact Check Router with static edges, since these nodes use dynamic commands to redirect the flow. The final edge connects the judge to an END marker to terminate the graph.

When the workflow runs, control will pass along these edges in order, but whenever we hit a router or moderator node, that node will output a command telling the graph which node to go to next (overriding the default edge). This is how we create conditional loops: the fact_check_router_node might send us back to a debater node for a retry, instead of following a straight line. LangGraph supports this by allowing nodes to return a special Command object with goto instructions.

In summary, at a high level we’ve defined an agentic workflow: a graph of autonomous agents where control can branch and loop based on the agents’ outputs. Now, let’s explore what each of these agent nodes actually does.

Agent Nodes Breakdown

Each stage or role in the debate is encapsulated in a node (agent). In LangGraph, nodes are often simple functions, but I wanted a more object-oriented approach for clarity and reusability. So in Deb8flow, every node is a class with a __call__ method. All the main agent classes inherit from a common BaseComponent for shared functionality. This design makes the system modular: we can easily swap out or extend agents by modifying their class definitions, and each agent class is responsible for its piece of the workflow.

Let’s go through the key agents one by one.

`BaseComponent` – A Reusable Agent Base Class

Most of our agent nodes (like the debaters and judge) share common needs: they use an LLM to generate output, they might need to retry on errors, and they should track token usage. The BaseComponent class (defined in nodes/base_component.py) provides these common features so we don’t repeat code.

class BaseComponent:
    """
    A foundational class for managing LLM-based workflows with token tracking.
    Can handle both Azure OpenAI (AzureChatOpenAI) and OpenAI (ChatOpenAI).
    """

    def __init__(
        self,
        llm_config: Optional[LLMConfig] = None,
        temperature: float = 0.0,
        max_retries: int = 5,
    ):
        """
        Initializes the BaseComponent with optional LLM configuration and temperature.

        Args:
            llm_config (Optional[LLMConfig]): Configuration for either Azure or OpenAI.
            temperature (float): Controls the randomness of LLM outputs. Defaults to 0.0.
            max_retries (int): How many times to retry on 429 errors.
        """
        logger = logging.getLogger(self.__class__.__name__)
        tracer = trace.get_tracer(__name__, tracer_provider=get_tracer_provider())

        self.logger = logger
        self.tracer = tracer
        self.llm: Optional[ChatOpenAI] = None
        self.output_parser: Optional[StrOutputParser] = None
        self.state: Optional[DebateState] = None
        self.prompt_template: Optional[ChatPromptTemplate] = None
        self.chain: Optional[RunnableSequence] = None
        self.documents: Optional[List] = None
        self.prompt_tokens = 0
        self.completion_tokens = 0
        self.max_retries = max_retries

        if llm_config is not None:
            self.llm = self._init_llm(llm_config, temperature)
            self.output_parser = StrOutputParser()

    def _init_llm(self, config: LLMConfig, temperature: float):
        """
        Initializes an LLM instance for either Azure OpenAI or standard OpenAI.
        """
        if isinstance(config, AzureOpenAILLMConfig):
            # If it's Azure, use the AzureChatOpenAI class
            return AzureChatOpenAI(
                deployment_name=config.deployment_name,
                azure_endpoint=config.azure_endpoint,
                openai_api_version=config.openai_api_version,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        elif isinstance(config, OpenAILLMConfig):
            # If it's standard OpenAI, use the ChatOpenAI class
            return ChatOpenAI(
                model_name=config.model_name,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        else:
            raise ValueError("Unsupported LLMConfig type.")

    def validate_initialization(self) -> None:
        """
        Ensures we have an LLM and an output parser.
        """
        if not self.llm:
            raise ValueError("LLM is not initialized. Ensure `llm_config` is provided.")
        if not self.output_parser:
            raise ValueError("Output parser is not initialized.")

    def execute_chain(self, inputs: Any) -> Any:
        """
        Executes the LLM chain, tracks token usage, and retries on 429 errors.
        """
        if not self.chain:
            raise ValueError("No chain is initialized for execution.")

        retry_wait = 1  # Initial wait time in seconds

        for attempt in range(self.max_retries):
            try:
                with get_openai_callback() as cb:
                    result = self.chain.invoke(inputs)
                    self.logger.info("Prompt Token usage: %s", cb.prompt_tokens)
                    self.logger.info("Completion Token usage: %s", cb.completion_tokens)
                    self.prompt_tokens = cb.prompt_tokens
                    self.completion_tokens = cb.completion_tokens

                return result

            except Exception as e:
                # If the error mentions 429, do exponential backoff and retry
                if "429" in str(e):
                    self.logger.warning(
                        f"Rate limit reached. Retrying in {retry_wait} seconds... "
                        f"(Attempt {attempt + 1}/{self.max_retries})"
                    )
                    time.sleep(retry_wait)
                    retry_wait *= 2
                else:
                    self.logger.error(f"Unexpected error: {str(e)}")
                    raise e

        raise Exception("API request failed after maximum number of retries")

    def create_chain(
        self, system_template: str, human_template: str
    ) -> RunnableSequence:
        """
        Creates a chain for unstructured outputs.
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm | self.output_parser
        return self.chain

    def create_structured_output_chain(
        self, system_template: str, human_template: str, output_model: Type[BaseModel]
    ) -> RunnableSequence:
        """
        Creates a chain that yields structured outputs (parsed into a Pydantic model).
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm.with_structured_output(output_model)
        return self.chain

    def build_return_with_tokens(self, node_specific_data: dict) -> dict:
        """
        Convenience method to add token usage info into the return values.
        """
        return {
            **node_specific_data,
            "prompt_tokens": self.prompt_tokens,
            "completion_tokens": self.completion_tokens,
        }

    def __call__(self, state: DebateState) -> None:
        """
        Updates the node's local copy of the state.
        """
        self.state = state
        for key, value in state.items():
            setattr(self, key, value)

Key features of BaseComponent:

It stores an LLM client (e.g. an OpenAI ChatOpenAI instance) initialized with a given model and API key, as well as an output parser.
It provides a method create_chain(system_template, human_template) which sets up a LangChain prompt chain (a RunnableSequence) combining a system prompt and a human prompt. This chain is what actually generates outputs when run.
It has an execute_chain(inputs) method that invokes the chain and includes logic to retry if the OpenAI API returns a rate-limit error (HTTP 429). This is done with exponential backoff up to a max_retries count.
It keeps track of token usage (prompt tokens and completion tokens) for logging or analysis.
The __call__ method of BaseComponent (which each subclass will call via super().__call__(state)) can perform any setup needed before the node’s main logic runs (like ensuring the LLM is initialized).

By building on BaseComponent, each agent class can focus on its unique logic (like what prompt to use and how to handle the state), while inheriting the heavy lifting of interacting with GPT-4o reliably.

Topic Generator Agent (`GenerateTopicNode`)

The Topic Generator (topic_generator_node.py) is the first agent in the graph. Its job is to come up with a debatable topic for the session. We give it a prompt that instructs it to output a nuanced topic that could reasonably have a pro and con side.

This agent inherits from BaseComponent and uses a prompt chain (system + human prompt) to generate one item of text – the debate topic. When called, it executes the chain (with no special input, just using the prompt) and gets back a topic_text. It then updates the state with:

debate_topic: the generated topic (stripped of any extra whitespace),
positions: a dictionary assigning the pro and con stances (by default we use "In favor of the topic" and "Against the topic"),
stage: set to "opening",
speaker: set to "pro" (so the Pro side will speak first).

In code, the return might look like:

return {
    "debate_topic": debate_topic,
    "positions": positions,
    "stage": "opening",
    "speaker": first_speaker  # "pro"
}

Here are the prompts for the topic generator:

SYSTEM_PROMPT = """\
You are a brainstorming AI that suggests debate topics.
You will provide a single, interesting or timely topic that can have two opposing views.
"""

HUMAN_PROMPT = """\
Please suggest one debate topic for two AI agents to discuss.
For example, it could be about technology, politics, philosophy, or any interesting domain.
Just provide the topic in a concise sentence.
"""

Then we pass these prompts in the constructor of the class itself.

class GenerateTopicNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        # Create the prompt chain.
        self.chain: RunnableSequence = self.create_chain(
            system_template=SYSTEM_PROMPT,
            human_template=HUMAN_PROMPT
        )

    def __call__(self, state: DebateState) -> Dict[str, str]:
        """
        Generates a debate topic and assigns positions to the two debaters.
        """
        super().__call__(state)

        topic_text = self.execute_chain({})

        # Store the topic and assign stances in the DebateState
        debate_topic = topic_text.strip()
        positions = {
            "pro": "In favor of the topic",
            "con": "Against the topic"
        }

        
        first_speaker = "pro"
        self.logger.info("Welcome to our debate panel! Today's debate topic is: %s", debate_topic)
        return {
            "debate_topic": debate_topic,
            "positions": positions,
            "stage": "opening",
            "speaker": first_speaker
        }

It’s a pattern we will repeat for all classes except for those not using LLMs and the fact checker.

Now we can implement the 2 stars of the show, the Pro and Con argument agents!

Debater Agents (Pro and Con)

Link: pro_debater_node.py

The two debater agents are very similar in structure, but each uses different prompt templates tailored to their role (pro vs con) and the stage of the debate.

The Pro debater, for example, has to handle an opening statement and a counter-argument (countering the Con’s rebuttal). We also need logic for retries in case a statement fails fact-check. In code, the ProDebater class sets up multiple prompt chains:

opening_chain and an opening_retry_chain (using slightly different human prompts – the retry prompt might instruct it to try again without repeating any factually dubious claims).
counter_chain and counter_retry_chain for the counter-argument stage.

class ProDebaterNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        self.opening_chain = self.create_chain(SYSTEM_PROMPT, OPENING_HUMAN_PROMPT)
        self.opening_retry_chain = self.create_chain(SYSTEM_PROMPT, OPENING_RETRY_HUMAN_PROMPT)
        self.counter_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_HUMAN_PROMPT)
        self.counter_retry_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_RETRY_HUMAN_PROMPT)

    def __call__(self, state: DebateState) -> Dict[str, Any]:
        super().__call__(state)

        debate_topic = state.get("debate_topic")
        messages = state.get("messages", [])
        stage = state.get("stage")
        speaker = state.get("speaker")

        # Check if retrying (last message was by pro and not validated)
        last_msg = messages[-1] if messages else None
        retrying = last_msg and last_msg["speaker"] == SPEAKER_PRO and not last_msg["validated"]

        if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            chain = self.opening_retry_chain if retrying else self.opening_chain # select which chain we are triggering: the normal one or the fact-cehcked one
            result = chain.invoke({
                "debate_topic": debate_topic
            })
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            opponent_msg = self._get_last_message_by(SPEAKER_CON, messages)
            debate_history = get_debate_history(messages)
            chain = self.counter_retry_chain if retrying else self.counter_chain
            result = chain.invoke({
                "debate_topic": debate_topic,
                "opponent_statement": opponent_msg,
                "debate_history": debate_history
            })
        else:
            raise ValueError(f"Unknown turn for ProDebater: stage={stage}, speaker={speaker}")
        new_message = create_debate_message(speaker=SPEAKER_PRO, content=result, stage=stage)
        self.logger.info("Speaker: %s, Stage: %s, Retry: %s\nMessage:\n%s", speaker, stage, retrying, result)
        return {
            "messages": messages + [new_message]
        }

    def _get_last_message_by(self, speaker_prefix, messages):
        for m in reversed(messages):
            if m.get("speaker") == speaker_prefix:
                return m["content"]
        return ""

When the ProDebater’s __call__ runs, it looks at the current stage and speaker in the state to decide what to do:

If it’s the opening stage and the speaker is “pro”, it uses the opening_chain to generate an opening argument. If the last message from Pro was marked invalid (not validated), it knows this is a retry, so it would use the opening_retry_chain instead.
If it’s the counter stage and speaker is “pro”, it generates a counter-argument to whatever the opponent (Con) just said. It will fetch the last message by the Con from the messages history, and feed that into the prompt (so that the Pro can directly counter it). Again, if the last Pro message was invalid, it would switch to the retry chain.

After generating its argument, the Debater agent creates a new message entry (with speaker="pro", the content text, validated=False initially, and the stage) and appends it to the state’s message list. That becomes the output of the node (LangGraph will merge this partial state update into the global state).

The Con Debater agent mirrors this logic for its stages:

It similarly appends its message to the state.

It has a rebuttal and closing argument (final argument) stage, each with a normal and a retry chain.

It checks if it’s the rebuttal stage (speaker “con”) or final argument stage (speaker “con”) and invokes the appropriate chain, possibly using the last Pro message for context when rebutting.

con_debater_node.py

By using class-based implementation, our debaters’ code is easier to maintain. We can clearly separate what the Pro does vs what the Con does, even if they share structure. Also, by encapsulating prompt chains inside the class, each debater can manage multiple possible outputs (regular vs retry) cleanly.

Prompt design: The actual prompts (in prompts/pro_debater_prompts.py and con_debater_prompts.py) guide the GPT-4o model to take on a persona (“You are a debater arguing for/against the topic…”) and produce the argument. They also instruct the model to keep statements factual and logical. If a fact check fails, the retry prompt may say something like: “Your previous statement had an unverified claim. Revise your argument to be factually correct while maintaining your position.” – encouraging the model to correct itself.

With this, our AI debaters can engage in a multi-turn duel, and even recover from factual missteps.

Fact Checker Agent (`FactCheckNode`)

After each debater speaks, the Fact Checker agent swoops in to verify their claims. This agent is implemented in fact_checker_node.py, and interestingly, it uses the GPT-4o model’s browsing ability rather than our own custom prompts. Essentially, we delegate the fact-checking to OpenAI’s GPT-4 with web search.

How does this work? The OpenAI Python client for GPT-4 (with browsing) allows us to send a user message and get a structured response. In FactCheckNode.__call__, we do something like:

completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-search-preview",
            web_search_options={},
            messages=[{
                "role": "user",
                "content": (
                        f"Consider the following statement from a debate. "
                        f"If the statement contains numbers, or figures from studies, fact-check it online.\n\n"
                        f"Statement:\n\"{claim}\"\n\n"
                        f"Reply clearly whether any numbers or studies might be inaccurate or hallucinated, and why."
                        f"\n"
                        f"If the statement doesn't contain references to studies or numbers cited, don't go online to fact-check, and just consider it successfully fact-checked, with a 'yes' score.\n\n"
                )
            }],
            response_format=FactCheck
        )

If the result is “yes” (meaning the claim seems truthful or at least not factually wrong), the Fact Checker will mark the last message’s validated field as True in the state, and output {"validated": True} with no further changes. This signals that the debate can continue normally.

If the result is “no” (meaning it found the claim to be incorrect or dubious), the Fact Checker will append a new message to the state with speaker="fact_checker" describing the finding (or we could simply mark it, but providing a brief note like “(Fact Checker: The statistic cited could not be verified.)” can be useful). It will also set validated: False and increment a counter for whichever side made the claim. The output state from this node includes validated: False and an updated times_pro_fact_checked or times_con_fact_checked count.

We also use a Pydantic BaseModel to control the output of the LLM:

class FactCheck(BaseModel):
    """
    Pydantic model for the fact checking the claims made by debaters.

    Attributes:
        binary_score (str): 'yes' if the claim is verifiable and truthful, 'no' otherwise.
    """

    binary_score: str = Field(
        description="Indicates if the claim is verifiable and truthful. 'yes' or 'no'."
    )
    justification: str = Field(
        description="Explanation of the reasoning behind the score."
    )

Debate Moderator Agent (`DebateModeratorNode`)

The Debate Moderator is the conductor of the debate. Instead of producing lengthy text, this agent’s job is to manage turn-taking and stage progression. In the workflow, after a statement is validated by the Fact Checker, control passes to the Moderator node. The Moderator then issues a Command that updates the state for the next turn and directs the flow to the appropriate next agent.

The logic in DebateModeratorNode.__call__ (see nodes/debate_moderator_node.py) goes roughly like this:

if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_REBUTTAL, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_REBUTTAL and speaker == SPEAKER_CON:
            return Command(
                update={"stage": STAGE_COUNTER, "speaker": SPEAKER_PRO},
                goto=NODE_PRO_DEBATER
            )
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_FINAL_ARGUMENT, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_FINAL_ARGUMENT and speaker == SPEAKER_CON:
            return Command(
                update={},
                goto=NODE_JUDGE
            )

        raise ValueError(f"Unexpected stage/speaker combo: stage={stage}, speaker={speaker}")

Each conditional corresponds to a point in the debate where a turn just ended, and sets up the next turn. For example, after the opening (Pro just spoke), it sets stage to rebuttal, switches speaker to Con, and directs the workflow to the Con debater node. After the final_argument (Con’s closing), it directs to the Judge with no further update (the debate stage effectively ends).

Fact Check Router (`FactCheckRouterNode`)

This is another control node (like the Moderator) that introduces conditional logic. The Fact Check Router sits right after the Fact Checker agent in the flow. Its purpose is to branch the workflow depending on the fact-check result.

In nodes/fact_check_router_node.py, the logic is:

if pro_fact_checks >= 3 or con_fact_checks >= 3:
            disqualified = SPEAKER_PRO if pro_fact_checks >= 3 else SPEAKER_CON
            winner = SPEAKER_CON if disqualified == SPEAKER_PRO else SPEAKER_PRO

            verdict_msg = {
                "speaker": "moderator",
                "content": (
                    f"Debate ended early due to excessive factual inaccuracies.\n\n"
                    f"DISQUALIFIED: {disqualified.upper()} (exceeded fact check limit)\n"
                    f"WINNER: {winner.upper()}"
                ),
                "validated": True,
                "stage": "verdict"
            }
            return Command(
                update={"messages": messages + [verdict_msg]},
                goto=END
            )
        if last_message.get("validated"):
            return Command(goto=NODE_DEBATE_MODERATOR)
        elif speaker == SPEAKER_PRO:
            return Command(goto=NODE_PRO_DEBATER)
        elif speaker == SPEAKER_CON:
            return Command(goto=NODE_CON_DEBATER)
        raise ValueError("Unable to determine routing in FactCheckRouterNode.")

First, the Fact Check Router checks if either side’s fact-check count has reached 3. If so, it creates a Moderator-style message announcing an early end: the offending side is disqualified and the other side is the winner. It appends this verdict to the messages and returns a Command that jumps to END, effectively terminating the debate without going to the Judge (because we already know the outcome).

If we’re not ending the debate early, it then looks at the Fact Checker’s result for the last message (which is stored as validated on that message). If validated is True, we go to the debate moderator: Command(goto=debate_moderator_node).

Else if the statement fails fact-check, the workflow goes back to the debater to produce a revised statement (with the state counters updated to reflect the failure). This loop can happen multiple times if needed (up to the disqualification limit).

This dynamic control is the heart of Deb8flow’s “agentic” nature – the ability to adapt the path of execution based on the content of the agents’ outputs. It showcases LangGraph’s strength: combining control flow with state. We’re essentially encoding debate rules (like allowing retries for false claims, or ending the debate if someone cheats too often) directly into the workflow graph.

Judge Agent (`JudgeNode`)

Last but not least, the Judge agent delivers the final verdict based on rhetorical skill, clarity, structure, and overall persuasiveness. Its system prompt and human prompt make this explicit:

System Prompt: “You are an impartial debate judge AI. … Evaluate which debater presented their case more clearly, persuasively, and logically. You must focus on communication skills, structure of argument, rhetorical strength, and overall coherence.”
Human Prompt: “Here is the full debate transcript. Please analyze the performance of both debaters—PRO and CON. Evaluate rhetorical performance—clarity, structure, persuasion, and relevance—and decide who presented their case more effectively.”

When the Judge node runs, it receives the entire debate transcript (all validated messages) alongside the original topic. It then uses GPT-4o to examine how each side framed their arguments, handled counterpoints, and supported (or failed to support) claims with examples or logic. Crucially, the Judge is forbidden to evaluate which position is objectively correct (or who it thinks might be correct)—only who argued more persuasively.

Below is an example final verdict from a Deb8flow run on the topic:
“Should governments implement a universal basic income in response to increasing automation in the workforce?”

WINNER: PRO

REASON: The PRO debater presented a more compelling and rhetorically effective case for universal basic income. Their arguments were well-structured, beginning with a clear statement of the issue and the necessity of UBI in response to automation. They effectively addressed potential counterarguments by highlighting the unprecedented speed and scope of current technological changes, which distinguishes the current situation from past technological shifts. The PRO also provided empirical evidence from UBI pilot programs to counter the CON's claims about work disincentives and economic inefficiencies, reinforcing their argument with real-world examples.

In contrast, the CON debater, while presenting valid concerns about UBI, relied heavily on historical analogies and assumptions about workforce adaptability without adequately addressing the unique challenges posed by modern automation. Their arguments about the fiscal burden and potential inefficiencies of UBI were less supported by specific evidence compared to the PRO's rebuttals.

Overall, the PRO's arguments were more coherent, persuasive, and backed by empirical evidence, making their case more convincing to a neutral observer.

Langsmith Tracing

Throughout Deb8flow’s development, I relied on LangSmith (LangChain’s tracing and observability toolkit) to ensure the entire debate pipeline was behaving correctly. Because we have multiple agents passing control between themselves, it’s easy for unexpected loops or misrouted states to occur. LangSmith provides a convenient way to:

Visualize Execution Flow: You can see each agent’s prompt, the tokens consumed (so you can also track costs), and any intermediate states. This makes it much simpler to confirm that, say, the Con Debater is properly referencing the Pro Debater’s last message, or that the Fact Checker is accurately receiving the claim to verify.
Debug State Updates: If the Moderator or Fact Check Router is sending the flow to the wrong node, the trace will highlight that mismatch. You can trace which agent was invoked at each step and why, helping you spot stage or speaker misalignments early.
Track Prompt and Completion Tokens: With multiple GPT-4o calls, it’s useful to see how many tokens each stage is using, which LangSmith logs automatically if you enable tracing.

Integrating LangSmith is unexpectedly easy. You will just need to provide these 3 keys in your .env file: LANGCHAIN_API_KEY

LANGCHAIN_TRACING_V2

LANGCHAIN_PROJECT

Then you can open the LangSmith UI to see a structured trace of each run. This greatly reduces the guesswork involved in debugging multi-agent systems and is, in my experience, essential for more complex AI orchestration like ours. Example of a single run:

The trace in waterfall mode in Lansmith of one run, showing how the whole flow ran. Source: Generated by the author using Langsmith.

Reflections and Next Steps

Building Deb8flow was an eye-opening exercise in orchestrating autonomous agent workflows. We didn’t just chain a single model call – we created an entire debate simulation with AI agents, each with a specific role, and allowed them to interact according to a set of rules. LangGraph provided a clear framework to define how data and control flows between agents, making the complex sequence manageable in code. By using class-based agents and a shared state, we maintained modularity and clarity, which will pay off for any software engineering project in the long run.

An exciting aspect of this project was seeing emergent behavior. Even though each agent follows a script (a prompt), the unscripted combination – a debater trying to deceive, a fact-checker catching it, the debater rephrasing – felt surprisingly realistic! It’s a small step toward more Agentic Ai systems that can perform non-trivial multi-step tasks with oversight on each other.

There’s plenty of ideas for improvement:

User Interaction: Currently it’s fully autonomous, but one could add a mode where a human provides the topic or even takes the role of one side against an AI opponent.
We can switch the order in which the Debaters talk.
We can change the prompts, and thus to a good degree the behavior of the agents, and experiment with different prompts.
Make the debaters also perform web search before producing their statements, thus providing them with the latest information.

The broader implication of Deb8flow is how it showcases a pattern for composable AI agents. By defining clear boundaries and interactions (just like microservices in software), we can have complex AI-driven processes that remain interpretable and controllable. Each agent is like a cog in a machine, and LangGraph is the gear system making them work in unison.

I found this project energizing, and I hope it inspires you to explore multi-agent workflows. Whether it’s debating, collaborating on writing, or solving problems from different expert angles, the combination of GPT, tools, and structured agentic workflows opens up a new world of possibilities for AI development. Happy hacking!

References

[1] D. Bouchard, “From Basics to Advanced: Exploring LangGraph,” Medium, Nov. 22, 2023. [Online]. Available: https://medium.com/data-science/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787. [Accessed: Apr. 1, 2025].

[2] A. W. T. Ng, “Building a Research Agent that Can Write to Google Docs: Part 1,” Towards Data Science, Jan. 11, 2024. [Online]. Available: https://towardsdatascience.com/building-a-research-agent-that-can-write-to-google-docs-part-1-4b49ea05a292/. [Accessed: Apr. 1, 2025].

The post Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o appeared first on Towards Data Science.

Why CatBoost Works So Well: The Engineering Behind the Magic

Shubham Gandhi — Thu, 10 Apr 2025 00:28:11 +0000

Gradient boosting is a cornerstone technique for modeling tabular data due to its speed and simplicity. It delivers great results without any fuss. When you look around you’ll see multiple options like LightGBM, XGBoost, etc. Catboost is one such variant. In this post, we will take a detailed look at this model, explore its inner workings, and understand what makes it a great choice for real-world tasks.

Target Statistic

Target Encoding Example: the average value of the target variable for a category is used to replace each category. Image by author

Target Encoding Example: the average value of the target variable for a category is used to replace each category

One of the important contributions of the CatBoost paper is a new method of calculating the Target Statistic. What is a Target Statistic? If you have worked with categorical variables before, you’d know that the most rudimentary way to deal with categorical variables is to use one-hot encoding. From experience, you’d also know that this introduces a can of problems like sparsity, curse of dimensionality, memory issues, etc. Especially for categorical variables with high cardinality.

Greedy Target Statistic

To avoid one-hot encoding, we calculate the Target Statistic instead for the categorical variables. This means we calculate the mean of the target variable at each unique value of the categorical variable. So if a categorical variable takes the values — A, B, C then we will calculate the average value of $\text{y}$ over all these values and replace these values with the average of $\text{y}$ at each unique value.

That sounds good, right? It does but this approach comes with its problems — namely Target Leakage. To understand this, let’s take an extreme example. Extreme examples are often the easiest way to eke out issues in the approach. Consider the below dataset:

Categorical Column	Target Column
A	0
B	1
C	0
D	1
E	0

Greedy Target Statistic: Compute the mean target value for each unique category

Now let’s write the equation for calculating the Target Statistic:
\[\hat{x}^i_k = \frac{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} \cdot y_j + a p
}{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} + a
}\]

Here $x^i_j$ is the value of the i-th categorical feature for the j-th sample. So for the k-th sample, we iterate over all samples of $x^i$, select the ones having the value $x^i_k$, and take the average value of $y$ over those samples. Instead of taking a direct average, we take a smoothened average which is what the $a$ and $p$ terms are for. The $a$ parameter is the smoothening parameter and $p$ is the global mean of $y$.

If we calculate the Target Statistic using the formula above, we get:

Categorical Column	Target Column	Target Statistic
A	0	$\frac{ap}{1+a}$
B	1	$\frac{1+ap}{1+a}$
C	0	$\frac{ap}{1+a}$
D	1	$\frac{1+ap}{1+a}$
E	0	$\frac{ap}{1+a}$

Calculation of Greedy Target Statistic with Smoothening

Now if I use this Target Statistic column as my training data, I will get a perfect split at $ threshold = \frac{0.5+ap}{1+a}$. Anything above this value will be classified as 1 and anything below will be classified as 0. I have a perfect classification at this point, so I get 100% accuracy on my training data.

Let’s take a look at the test data. Here, since we are assuming that the feature has all unique values, the Target Statistic becomes—
\[TS = \frac{0+ap}{0+a} = p\]
If $threshold$ is greater than $p$, all test data predictions will be $0$. Conversely, if $threshold$ is less than $p$, all test data predictions will be $1$ leading to poor performance on the test set.

Although we rarely see datasets where values of a categorical variable are all unique, we do see cases of high cardinality. This extreme example shows the pitfalls of using Greedy Target Statistic as an encoding approach.

Leave One Out Target Statistic

So the Greedy TS didn’t work out quite well for us. Let’s try another method— the Leave One Out Target Statistic method. At first glance, this looks promising. But, as it turns out, this too has its problems. Let’s see how with another extreme example. This time let’s assume that our categorical variable $x^i$ has only one unique value, i.e., all values are the same. Consider the below data:

Categorical Column	Target Column
A	0
A	1
A	0
A	1

Example data for an extreme case where a categorical feature has just one unique value

If calculate the leave one out target statistic, we get:

Categorical Column	Target Column	Target Statistic
A	0	$\frac{n^+ -y_k + ap}{n+a}$
A	1	$\frac{n^+ -y_k + ap}{n+a}$
A	0	$\frac{n^+ -y_k + ap}{n+a}$
A	1	$\frac{n^+ -y_k + ap}{n+a}$

Calculation of Leave One Out Target Statistic with Smoothening

Here:
$n$ is the total samples in the data (in our case this 4)
$n^+$ is the number of positive samples in the data (in our case this 2)
$y_k$ is the value of the target column in that row
Substituting the above, we get:

Categorical Column	Target Column	Target Statistic
A	0	$\frac{2 + ap}{4+a}$
A	1	$\frac{1 + ap}{4+a}$
A	0	$\frac{2 + ap}{4+a}$
A	1	$\frac{1 + ap}{4+a}$

Substituing values of n and n⁺

Now, if I use this Target Statistic column as my training data, I will get a perfect split at $ threshold = \frac{1.5+ap}{4+a}$. Anything above this value will be classified as 0 and anything below will be classified as 1. I have a perfect classification at this point, so I again get 100% accuracy on my training data.

You see the problem, right? My categorical variable which doesn’t have more than a unique value is producing different values for Target Statistic which will perform great on the training data but will fail miserably on the test data.

Ordered Target Statistic

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

CatBoost introduces a technique called Ordered Target Statistic to address the issues discussed above. This is the core principle of CatBoost’s handling of categorical variables.

This method, inspired by online learning, uses only past data to make predictions. CatBoost generates a random permutation (random ordering) of the training data($\sigma$). To compute the Target Statistic for a sample at row $k$, CatBoost uses samples from row $1$ to $k-1$. For the test data, it uses the entire train data to compute the statistic.

Additionally, CatBoost generates a new permutation for each tree, rather than reusing the same permutation each time. This reduces the variance that can arise in the early samples.

Ordered Boosting

This visualization shows how CatBoost computes residuals and updates the model: for sample xᵢ, the model predicts using only earlier data points. Source

Another important innovation introduced by the CatBoost paper is its use of Ordered Boosting. It builds on similar principles as ordered target statistics, where CatBoost randomly permutes the training data at the start of each tree and makes predictions sequentially.

In traditional boosting methods, when training tree $t$, the model uses predictions from the previous tree $t−1$ for all training samples, including the one it is currently predicting. This can lead to target leakage, as the model may indirectly use the label of the current sample during training.

To address this issue, CatBoost uses Ordered Boosting where, for a given sample, it only uses predictions from previous rows in the training data to calculate gradients and build trees. For each row $i$ in the permutation, CatBoost calculates the output value of a leaf using only the samples before $i$. The model uses this value to get the prediction for row $i$. Thus, the model predicts each row without looking at its label.

CatBoost trains each tree using a new random permutation to average the variance in early samples in one permutation.
Let’s say we have 5 data points: A, B, C, D, E. CatBoost creates a random permutation of these points. Suppose the permutation is: σ = [C, A, E, B, D]

Step	Data Used to Train	Data Point Being Predicted	Notes
1	—	C	No previous data → use prior
2	C	A	Model trained on C only
3	C, A	E	Model trained on C, A
4	C, A, E	B	Model trained on C, A, E
5	C, A, E, B	D	Model trained on C, A, E, B

Table highlighting how CatBoost uses random permutation to perform training

This avoids using the actual label of the current row to get the prediction thus preventing leakage.

Building a Tree

Each time CatBoost builds a tree, it creates a random permutation of the training data. It calculates the ordered target statistic for all the categorical variables with more than two unique values. For a binary categorical variable, it maps the values to zeros and ones.

CatBoost processes data as if the data is arriving sequentially. It begins with an initial prediction of zero for all instances, meaning the residuals are initially equivalent to the target values.

As training proceeds, CatBoost updates the leaf output for each sample using the residuals of the previous samples that fall into the same leaf. By not using the current sample’s label for prediction, CatBoost effectively prevents data leakage.

Split Candidates

CatBoost bins continuous features to reduce the search space for optimal splits. Each bin edge and split point represents a potential decision threshold. Image by author

At the core of a decision tree lies the task of selecting the optimal feature and threshold for splitting a node. This involves evaluating multiple feature-threshold combinations and selecting the one that gives the best reduction in loss. CatBoost does something similar. It discretizes the continuous variable into bins to simplify the search for the optimal combination. It evaluates each of these feature-bin combinations to determine the best split

CatBoost uses Oblivious Trees, a key difference compared to other trees, where it uses the same split across all nodes at the same depth.

Oblivious Trees

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

Unlike standard decision trees, where different nodes can split on different conditions (feature-threshold), Oblivious Trees split across the same conditions across all nodes at the same depth of a tree. At a given depth, all samples are evaluated at the same feature-threshold combination. This symmetry has several implications:

Speed and simplicity: since the same condition is applied across all nodes at the same depth, the trees produced are simpler and faster to train
Regularization: Since all trees are forced to apply the same condition across the tree at the same depth, there is a regularization effect on the predictions
Parallelization: the uniformity of the split condition, makes it easier to parallelize the tree creation and usage of GPU to accelerate training

Conclusion

CatBoost stands out by directly tackling a long-standing challenge: how to handle categorical variables effectively without causing target leakage. Through innovations like Ordered Target Statistics, Ordered Boosting, and the use of Oblivious Trees, it efficiently balances robustness and accuracy.

If you found this deep dive helpful, you might enjoy another deep dive on the differences between Stochastic Gradient Classifer and Logistic Regression

Circuit Tracing: A Step Closer to Understanding Large Language Models

Sudheer Singh — Tue, 08 Apr 2025 18:38:39 +0000

Context

Over the years, Transformer-based large language models (LLMs) have made substantial progress across a wide range of tasks evolving from simple information retrieval systems to sophisticated agents capable of coding, writing, conducting research, and much more. But despite their capabilities, these models are still largely black boxes. Given an input, they accomplish the task but we lack intuitive ways to understand how the task was actually accomplished.

LLMs are designed to predict the statistically best next word/token. But do they only focus on predicting the next token, or plan ahead? For instance, when we ask a model to write a poem, is it generating one word at a time, or is it anticipating rhyme patterns before outputting the word? or when asked about basic reasoning question like what is state capital where city Dallas is located? They often produce results that looks like a chain of reasoning, but did the model actually use that reasoning? We lack visibility into the model’s internal thought process. To understand LLMs, we need to trace their underlying logic.

The study of LLMs internal computation falls under “Mechanistic Interpretability,” which aims to uncover the computational circuit of models. Anthropic is one of the leading AI companies working on interpretability. In March 2025, they published a paper titled “Circuit Tracing: Revealing Computational Graphs in Language Models,” which aims to tackle the problem of circuit tracing.

This post aims to explain the core ideas behind their work and build a foundation for understating circuit tracing in LLMs.

What is a circuit in LLMs?

Before we can define a “circuit” in language models, we first need to look inside the LLM. It’s a Neural Network built on the transformer architecture, so it seems obvious to treat neurons as a basic computational unit and interpret the patterns of their activations across layers as the model’s computation circuit.

However, the “Towards Monosemanticity” paper revealed that tracking neuron activations alone doesn’t provide a clear understanding of why those neurons are activated. This is because individual neurons are often polysemantic they respond to a mix of unrelated concepts.

The paper further showed that neurons are composed of more fundamental units called features, which capture more interpretable information. In fact, a neuron can be seen as a combination of features. So rather than tracing neuron activations, we aim to trace feature activations the actual units of meaning driving the model’s outputs.

With that, we can define a circuit as a sequence of feature activations and connections used by the model to transform a given input into an output.

Now that we know what we’re looking for, let’s dive into the technical setup.

Technical Setup

We’ve established that we need to trace feature activations rather than neuron activations. To enable this, we need to convert the neurons of the existing LLM models into features, i.e. build a replacement model that represents computations in terms of features.

Before diving into how this replacement model is constructed, let’s briefly review the architecture of Transformer-based large language models.

The following diagram illustrates how transformer-based language models operate. The idea is to convert the input into tokens using embeddings. These tokens are passed to the attention block, which calculates the relationships between tokens. Then, each token is passed to the multi-layer perceptron (MLP) block, which further refines the token using a non-linear activation and linear transformations. This process is repeated across many layers before the model generates the final output.

Image by Author

Now that we have laid out the structure of transformer based LLM, let’s looks at what transcoders are. The authors have used a “Transcoder” to develop the replacement model.

Transcoders

A transcoder is a neural network (generally with a much higher dimension than LLM’s dimension) in itself designed to replace the MLP block in a transformer model with a more interpretable, functionally equivalent component (feature).

Image by Author

It processes tokens from the attention block in three stages: encoding, sparse activation, and decoding. Effectively, it scales the input to a higher-dimensional space, applies activation to force the model to activate only sparse features, and then compresses the output back to the original dimension in the decoding stage.

Image by Author

With a basic understanding of transformer-based LLMs and transcoder, let’s look at how a transcoder is used to build a replacement model.

Construct a replacement model

As mentioned earlier, a transformer block typically consists of two main components: an attention block and an MLP block (feedforward network). To build a replacement model, the MLP block in the original transformer model is replaced with a transcoder. This integration is seamless because the transcoder is trained to mimic the output of the original MLP, while also exposing its internal computations through sparse and modular features.

While standard transcoders are trained to imitate the MLP behavior within a single transformer layer, the authors of the paper used a cross layer transcoder (CLT), which captures the combined effects of multiple transcoder blocks across several layers. This is important because it allows us to track if a feature is spread across multiple layers, which is needed for circuit tracing.

The below image illustrates how the cross-layer transcoders (CLT) setup is used in building a replacement model. The Transcoder output at layer 1 contributes to constructing the MLP-equivalent output in all the upper layers until the end.

Image by Author

Side Note: the following image is from the paper and shows how a replacement model is constructed. it replaces the neuron of the original model with features.

Image from https://transformer-circuits.pub/2025/attribution-graphs/methods.html#graphs-constructing

Now that we understand the architecture of the replacement model, let’s look at how the interpretable presentation is built on the replacement model’s computational path.

Interpretable presentation of model’s computation: Attribution graph

To build the interpretable representation of the model’s computational path, we start from the model’s output feature and trace backward through the feature network to uncover which earlier feature contributed to it. This is done using the backward Jacobian, which tells how much a feature in the previous layer contributed to the current feature activation, and is applied recursively until we reach the input. Each feature is considered as a node and each influence as an edge. This process can lead to a complex graph with millions of edges and nodes, hence pruning is done to keep the graph compact and manually interpretable.

The authors refer to this computational graph as an attribution graph and have also developed a tool to inspect it. This forms the core contribution of the paper.

The image below illustrate a sample attribution graph.

Image from https://transformer-circuits.pub/2025/attribution-graphs/methods.html#graphs

Now, with all this understanding, we can go to feature interpretability.

Feature interpretability using an attribution graph

The researchers used attribution graphs on Anthropic’s Claude 3.5 Haiku model to study how it behaves across different tasks. In the case of poem generation, they discovered that the model doesn’t just generate the next word. It engages in a form of planning, both forward and backward. Before generating a line, the model identifies several possible rhyming or semantically appropriate words to end with, then works backward to craft a line that naturally leads to that target. Surprisingly, the model appears to hold multiple candidate end words in mind simultaneously, and it can restructure the entire sentence based on which one it ultimately chooses.

This technique offers a clear, mechanistic view of how language models generate structured, creative text. This is a significant milestone for the AI community. As we develop increasingly powerful models, the ability to trace and understand their internal planning and execution will be essential for ensuring alignment, safety, and trust in AI systems.

Limitations of the current approach

Attribution graphs offer a way to trace model behavior for a single input, but they don’t yet provide a reliable method for understanding global circuits or the consistent mechanisms a model uses across many examples. This analysis relies on replacing MLP computations with transcoders, but it is still unclear whether these transcoders truly replicate the original mechanisms or simply approximate the outputs. Additionally, the current approach highlights only active features, but inactive or inhibitory ones can be just as important for understanding the model’s behavior.

Conclusion

Circuit tracing via attribution graph is an early but important step toward understanding how language models work internally. While this approach still has a long way to go, the introduction of circuit tracing marks a major milestone on the path to true interpretability.

References:

The post Circuit Tracing: A Step Closer to Understanding Large Language Models appeared first on Towards Data Science.

Creating an AI Agent to Write Blog Posts with CrewAI

Gustavo Santos — Fri, 04 Apr 2025 18:11:49 +0000

Introduction

I love writing. You may notice that if you follow me or my blog. For that reason, I am constantly producing new content and talking about Data Science and Artificial Intelligence.

I discovered this passion a couple of years ago when I was just starting my path in Data Science, learning and evolving my skills. At that time, I heard some more experienced professionals in the area saying that a good study technique was practicing new skills and writing about it somewhere, teaching whatever you learned.

In addition, I had just moved to the US, and nobody knew me here. So I had to start somewhere, creating my professional image in this competitive market. I remember I talked to my cousin, who’s also in the Tech industry, and he told me: write blog posts about your experiences. Tell people what you are doing. And so I did.

And I never stopped.

Fast forward to 2025, now I have almost two hundred published articles, many of them with Towards Data Science, a published Book, and a good audience.

Writing helped me so much in the Data Science area.

Most recently, one of my interests has been the amazing Natural Language Processing and Large Language Models subjects. Learning about how these modern models work is fascinating.

That interest led me to experiment with Agentic Ai as well. So, I learned about CrewAI, an easy and open-source package that helps us build AI agents in a fun and easy way, with little code. I decided to test it by creating a crew of agents to write a blog post, and then see how that goes.

In this post, we will learn how to create those agents and make them work together to produce a simple blog post.

Let’s do that.

What is a Crew?

A crew of AI agents is a combination of two or more agents, each of them performing a task towards a final goal.

In this case study, we will create a crew that will work together to produce a small blog post about a given topic that we will provide.

Crew of Agents workflow. Image by the author

The flow works like this:

We choose a given topic for the agents to write about.
Once the crew is started, it will go to the knowledge base, read some of my previously written articles, and try to mimic my writing style. Then, it generates a set of guidelines and passes it to the next agent.
Next, the Planner agent takes over and searches the Internet looking for good content about the topic. It creates a plan of content and sends it to the next agent.
The Writer agent receives the writing plan and executes it according to the context and information received.
Finally, the content is passed to the last agent, the Editor, who reviews the content and returns the final document as the output.

In the following section, we will see how this can be created.

Code

CrewAI is a great Python package because it simplifies the code for us. So, let’s begin by installing the two needed packages.

pip install crewai crewai-tools

Next, if you want, you can follow the instructions on their Quickstart page and have a full project structure created for you with just a couple of commands on a terminal. Basically, it will install some dependencies, generate the folder structure suggested for CrewAI projects, as well as generate some .yaml and .py files.

I personally prefer to create those myself, but it is up to you. The page is listed in the References section.

Folder Structure

So, here we go.

We will create these folders:

knowledge
config

And these files:

In the config folder: create the files agents.yaml and tasks.yaml
In the knowledge folder, that’s where I will add the files with my writing style.
In the project root: create crew.py and main.py.

Folders structure. Image by the author.

Make sure to create the folders with the names mentioned, as CrewAI looks for agents and tasks inside the config folder and for the knowledge base within a knowledge folder.

Next, let us set our agents.

Agents

The agents are composed of:

Name of the agent: writer_style
Role: LLMs are good role players, so here you can tell them which role to play.
Goal: tell the model what the goal of that agent is.
Backstory: Describe the story behind this agent, who it is, what it does.

writer_style:
  role: >
    Writing Style Analyst
  goal: >
    Thoroughly read the knowledge base and learn the characteristics of the crew, 
    such as tone, style, vocabulary, mood, and grammar.
  backstory: >
    You are an experienced ghost writer who can mimic any writing style.
    You know how to identify the tone and style of the original writer and mimic 
    their writing style.
    Your work is the basis for the Content Writer to write an article on this topic.

I won’t bore you with all the agents created for this crew. I believe you got the idea. It is a set of prompts explaining to each agent what they are going to do. All the agents instructions are stored in the agents.yaml file.

Think of it as if you were a manager hiring people to create a team. Think about what kinds of professionals you would need, and what skills are needed.

We need 4 professionals who will work towards the final goal of producing written content: (1) a Writer Stylist, (2) a Planner, (3) a Writer, and (4) an Editor.

If you want to see the setup for them, just check the full code in the GitHub repository.

Tasks

Now, back to the analogy of the manager hiring people, once we “hired” our entire crew, it is time to separate the tasks. We know that we want to produce a blog post, we have 4 agents, but what each of them will do.

Well, that will be configured in the file tasks.yaml.

To illustrate, let me show you the code for the Writer agent. Once again, these are the parts needed for the prompt:

Name of the task: write
Description: The description is like telling the professional how you want that task to be performed, just like we would tell a new hire how to perform their new job. Give precise instructions to get the best result possible.
Expected output: This is how we want to see the output. Notice that I give instructions like the size of the blog post, the quantity of paragraphs, and other information that helps my agent to give me the expected output.
Agent to perform it: Here, we are indicating the agent who will perform this task, using the same name set in the agents.yaml file.
Output file: Now always applicable, but if so, this is the argument to use. We asked for a markdown file as output.

write:
  description: >
    1. Use the content plan to craft a compelling blog post on {topic}.
    2. Incorporate SEO keywords naturally.
    3. Sections/Subtitles are properly named in an engaging manner. Make sure 
    to add Introduction, Problem Statement, Code, Before You Go, References.
    4. Add a summarizing conclusion - This is the "Before You Go" section.
    5. Proofread for grammatical errors and alignment with the writer's style.
    6. Use analogies to make the article more engaging and complex concepts easier
    to understand.
  expected_output: >
    A well-written blog post in markdown format, ready for publication.
    The article must be within a 7 to 12 minutes read.
    Each section must have at least 3 paragraphs.
    When writing code, you will write a snippet of code and explain what it does. 
    Be careful to not add a huge snippet at a time. Break it in reasonable chunks.
    In the examples, create a sample dataset for the code.
    In the Before You Go section, you will write a conclusion that is engaging
    and factually accurate.
  agent: content_writer
  output_file: blog_post.md

After the agents and tasks are defined, it is time to create our crew flow.

Coding the Crew

Now we will create the file crew.py, where we will translate the previously presented flow to Python code.

We begin by importing the needed modules.

#Imports
import os
from crewai import Agent, Task, Process, Crew, LLM
from crewai.project import CrewBase, agent, crew, task
from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource
from crewai_tools import SerperDevTool

We will use the basic Agent, Task, Crew, Process and LLM to create our flow. PDFKnowledgeSource will help the first agent learning my writing style, and SerperDevTool is the tool to search the internet. For that one, make sure to get your API key at https://serper.dev/signup.

A best practice in software development is to keep your API keys and configuration settings separate from your code. We’ll use a .env file for this, providing a secure place to store these values. Here’s the command to load them into our environment.

from dotenv import load_dotenv
load_dotenv()

Then, we will use the PDFKnowledgeSource to show the Crew where to search for the writer’s style. By default, that tool looks at the knowledge folder of your project, thus the importance of the name being the same.

# Knowledge sources

pdfs = PDFKnowledgeSource(
    file_paths=['article1.pdf',
                'article2.pdf',
                'article3.pdf'
                ]
)

Now we can set up the LLM we want to use for the Crew. It can be any of them. I tested a bunch of them, and those I liked the most were qwen-qwq-32b and gpt-4o. If you choose OpenAI’s, you will need an API Key as well. For Qwen-QWQ, just uncomment the code and comment out the OpenAI’s lines.. You need an API key from Groq.

# LLMs

llm = LLM(
    # model="groq/qwen-qwq-32b",
    # api_key= os.environ.get("GROQ_API_KEY"),
    model= "gpt-4o",
    api_key= os.environ.get("OPENAI_API_KEY"),
    temperature=0.4
)

Now we have to create a Crew Base, showing where CrewAI can find the agents and tasks configuration files.

# Creating the crew: base shows where the agents and tasks are defined

@CrewBase
class BlogWriter():
    """Crew to write a blog post"""
    agents_config = "config/agents.yaml"
    tasks_config = "config/tasks.yaml"

Agents Functions

And we are ready to create the code for each agent. They are composed of a decorator @agent to show that the following function is an agent. We then use the class Agent and indicate the name of the agent in the config file, the level of verbosity, being 1 low, 2 high. You can also use a Boolean value, such as true or false.

Lastly, we specify if the agent uses any tool, and what model it will use.

# Configuring the agents
    @agent
    def writer_style(self) -> Agent:
        return Agent(
                config=self.agents_config['writer_style'],
                verbose=1,
                knowledge_sources=[pdfs]
                )

    @agent
    def planner(self) -> Agent:
        return Agent(
        config=self.agents_config['planner'],
        verbose=True,
        tools=[SerperDevTool()],
        llm=llm
        )

    @agent
    def content_writer(self) -> Agent:
        return Agent(
        config=self.agents_config['content_writer'],
        verbose=1
        )

    @agent
    def editor(self) -> Agent:
        return Agent(
        config=self.agents_config['editor'],
        verbose=1
        )

Tasks Functions

The next step is creating the tasks. Similarly to the agents, we will create a function and decorate it with @task. We use the class Task to inherit CrewAI’s functionalities and then point to the task to be used from our tasks.yaml file to be used for each task created. If any output file is expected, use the output_file argument.

# Configuring the tasks    

    @task
    def style(self) -> Task:
        return Task(
        config=self.tasks_config['mystyle'],
        )

    @task
    def plan(self) -> Task:
        return Task(
        config=self.tasks_config['plan'],
        )

    @task
    def write(self) -> Task:
        return Task(
        config=self.tasks_config['write'],
        output_file='output/blog_post.md' # This is the file that will be contain the final blog post.
        )

    @task
    def edit(self) -> Task:
        return Task(
        config=self.tasks_config['edit']
        )

Crew

To glue everything together, we now create a function and decorate it with the @crew decorator. That function will line up the agents and the tasks in the order to be performed, since the process chosen here is the simplest: sequential. In other words, everything runs in sequence, from start to finish.

@crew

    def crew(self) -> Crew:
        """Creates the Blog Post crew"""

        return Crew(
            agents= [self.writer_style(), self.planner(), self.content_writer(), self.editor(), self.illustrator()],
            tasks= [self.style(), self.plan(), self.write(), self.edit(), self.illustrate()],
            process=Process.sequential,
            verbose=True
        )

Running the Crew

Running the crew is very simple. We create the main.py file and import the Crew Base BlogWriter created. Then we just use the functions crew().kickoff(inputs) to run it, passing a dictionary with the inputs to be used to generate the blog post.

# Script to run the blog writer project

# Warning control
import warnings
warnings.filterwarnings('ignore')
from crew import BlogWriter


def write_blog_post(topic: str):
    # Instantiate the crew
    my_writer = BlogWriter()
    # Run
    result = (my_writer
              .crew()
              .kickoff(inputs = {
                  'topic': topic
                  })
    )

    return result

if __name__ == "__main__":

    write_blog_post("Price Optimization with Python")

There it is. The result is a nice blog post created by the LLM. See below.

Resulting blog post. GIF by the author.

That is so nice!

Before You Go

Before you go, know that this blog post was 100% created by me. This crew I created was an experiment I wanted to do to learn more about how to create AI agents and make them work together. And, like I said, I love writing, so this is something I would be able to read and assess the quality.

My opinion is that this crew still didn’t do a very good job. They were able to complete the tasks successfully, but they gave me a very shallow post and code. I would not publish this, but at least it could be a start, maybe.

From here, I encourage you to learn more about CrewAI. I took their free course where João de Moura, the creator of the package, shows us how to create different kinds of crews. It is really interesting.

GitHub Repository

https://github.com/gurezende/Crew_Writer

About Me

If you want to learn more about my work, or follow my blog (it is really me!), here are my contacts and portfolio.

https://gustavorsantos.me

References

[Quickstart CrewAI](https://docs.crewai.com/quickstart)

[CrewAI Documentation](https://docs.crewai.com/introduction)

[GROQ](https://groq.com/)

[OpenAI](https://openai.com)

[CrewAI Free Course](https://learn.crewai.com/)

The post Creating an AI Agent to Write Blog Posts with CrewAI appeared first on Towards Data Science.

Kernel Case Study: Flash Attention

Arunjith A — Thu, 03 Apr 2025 18:53:44 +0000

The attention mechanism is at the core of modern day transformers. But scaling the context window of these transformers was a major challenge, and it still is even though we are in the era of a million tokens + context window (Qwen 2.5 [1]). There are both considerable compute and memory bound complexities in these models when we scale the context window (A naive Attention Mechanism scales quadratically in both compute and memory requirements). Revisiting Flash Attention lets us understand the complexities of optimizing the underlying operations on GPUs and more importantly gives us a better grip on thinking what’s next.

Let’s quickly revisit a naive attention algorithm to see what’s going on.

Attention Algorithm. Image by Author

As you can see if we are not being careful then we will end up materializing a full NxM attention matrix into the GPU HBM. Meaning the memory requirement will go up quadratically to increasing context length.

If you wanna learn more about the GPU memory hierarchy and its differences, my previous post on Triton is a good starting point. This would also be handy as we go along in this post when we get to implementing the Flash Attention kernel in triton. The flash attention paper also has some really good introduction to this.

Additionally, when we look at the steps involved in executing this algorithm and its pattern of accessing the slow HBM, (which as explained later in the post could be a major bottleneck as well) we notice a few things:

We have Q, K and V in the HBM initially
We need to access Q and K initially from the HBM to compute the dot product
We write the output scores back to the HBM
We access it again to execute the softmax, and optionally for Causal attention, like in the case of LLMs, we will have to mask this output before the softmax. The resulting full attention matrix is written again into the HBM
We access the HBM again to execute the final dot product, to get both the attention weights and the Value matrix to write the output back to the slow GPU memory

I think you get the point. We could smartly read and write from the HBM to avoid redundant operations, to make some potential gains. This is exactly the primary motivation for the original Flash Attention algorithm.

Flash Attention initially came out in 2022 [2], and then a year later came out with some much needed improvements in 2023 as Flash Attention v2 [3] and again in 2024 with additional improvements for Nvidia Hopper and Blackwell GPUs [4] as Flash Attention v3 [5]. The original attention paper identified that the attention operation is still limited by memory bandwidth rather than compute. (In the past, there have been attempts to reduce the computation complexity of Attention from O(N**2) to O(NlogN) and lower through approximate algorithms)

Flash attention proposed a fused kernel which does all of the above attention operations in one go, block-wise, to get the final attention output without ever having to realize the full N**2 attention matrix in memory, making the algorithm significantly faster. The term `fused` simply means we combine multiple operations in the GPU SRAM before invoking the much slower journey across the slower GPU memory, making the algorithm performant. All the while providing the exact attention output without any approximations.

This lecture, from Stanford CS139, demonstrates brilliantly how we can think of the impact of a well thought out memory access pattern can have on an algorithm. I highly recommend you check this one out if you haven’t already.

Before we start diving into flash attention (it’s getting tedious to type this over and over so let’s agree to call it FA, shall we?) in triton there is something else that I wanted to get out of the way.

Numerical Stability in exponents

Let’s take the example of FP32 numbers. float32 (standard 32-bit float) uses 1 sign bit, 8 exponent bits, and 23 mantissa bits [6]. The largest finite base for the exponent in float32 is 2¹²⁷≈1.7×10³⁸. Which implies when we look at exponents, e⁸⁸ ≈ 1.65×10³⁸, anything close to 88 (although in reality would be much lower to keep it safe) and we are in trouble as we could easily overflow. Here’s a very interesting chat with OpenAI o1 shared by folks at AllenAI in their OpenInstruct repo. This although is talking about stabilizing KL Divergence calculations in the setting of RLHF/RL, the ideas translate exactly to exponents as well. So to deal with the softmax situation in attention what we do is the following:

Softmax with rescaling. Image by Author

TRICK : Let’s also observe the following, if you do this:

Rescaling Trick. Image by Author

then you can rescale/readjust values without affecting the final softmax value. This is really useful when you have an initial estimate for the maximum value, but that might change when we encounter a new set of values. I know I know, stay with me and let me explain.

Setting the scene

Let’s take a small detour into matrix multiplication.

Blocked Matrix Multiplication. Image by Author

This shows a toy example of a blocked matrix multiplication except we have blocks only on the rows of A (green) and columns of B (Orange? Beige?). As you can see above the output O1, O2, O3 and O4 are complete (those positions need no more calculations). We just need to fill in the remaining columns in the initial rows by using the remaining columns of B. Like below:

Next set of block fill the remaining spaces up. Image by Author

So we can fill these places in the output with a block of columns from B and a block of rows from A at a time.

Connecting the dots

When I introduced FA, I said that we never have to compute the full attention matrix and store the whole thing. So here’s what we do:

Compute a block of the attention matrix using a block of rows from Q and a block of columns from K. Once you get the partial attention matrix compute a few statistics and keep it in the memory.

Computing block attention scores S_b, and computing the row-wise maximums. Image by Author

I have greyed O5 to O12 because we don’t know those values yet, as they need to come from the subsequent blocks. We then transform Sb like below:

Keeping a track of the current row-sum and row-maxes. Image by Author

Exponents with the scaling trick. Image by Author

Now you have the setup for a partial softmax

Partial Softmax, as the denominator is still a partial sum. Image by Author

But:

What if the true maximum is in the Oi’s that are yet to come?
The sum is still local, so we need to update this every time we see new Pi’s. We know how to keep track of a sum, but what about rebasing it to the true maximum?

Recall the trick above. All that we have to do is to keep a track of the maximum values we encounter for each row, and iteratively update as you see new maximums from the remaining blocks of columns from K for the same set of rows from Q.

Two consecutive blocks and its row max manipulations. Image by Author

Updating the estimate of our current sum with rescaling

We still do not want to write our partial softmax matrix into HBM. We keep it for the next step.

The final dot product

The last step in our attention computation is our dot product with V. To start we would have initialized a matrix full of 0’s in our HBM as our output of shape NxD. Where N is the number of Queries as above. We use the same block size for V as we had for K except we can apply it row wise like below (The subscripts just denote that this is only a block and not the full matrix)

A single block of attention scores creating a partial output. Image by Author

Whereas the full output would require the sum of all these dot products. Some of which will be filled in by the blocks to come. Image by Author

Notice how we need the attention scores from all the blocks to get the final product. But if we calculate the local score and `accumulate` it like how we did to get the actual Ls we can form the full output at the end of processing all the blocks of columns (K_b) for a given row block (Q_b).

Putting it all together

Let’s put all these ideas together to form the final algorithm

Flash Attention V1 Algorithm. Source: Tri Dao et.al [2]

To understand the notation, __ij implies that it is the local values for a given block of columns and rows and _i implies it’s for the global output rows and Query blocks. The only part we haven’t explained so far is the final update to O_i. That’s where we use all the ideas from above to get the right scaling.

The whole code is available as a gist here.

Let’s see what these initializations look like in torch:

def flash_attn_v1(Q, K, V, Br, Bc):
  """Flash Attention V1"""
  B, N, D = Q.shape
  M = K.shape[1]
  Nr = int(np.ceil(N/Br))
  Nc = int(np.ceil(N/Bc))
  
  Q = Q.to('cuda')
  K = K.to('cuda')
  V = V.to('cuda')
  
  batch_stride = Q.stride(0)
  
  O = torch.zeros_like(Q).to('cuda')
  lis = torch.zeros((B, Nr, int(Br)), dtype=torch.float32).to('cuda')
  mis = torch.ones((B, Nr, int(Br)), dtype=torch.float32).to('cuda')*-torch.inf
  
  grid = (B, )
  flash_attn_v1_kernel[grid](
      Q, K, V,
      N, M, D,
      Br, Bc,
      Nr, Nc,
      batch_stride,
      Q.stride(1),
      K.stride(1),
      V.stride(1),
      lis, mis,
      O,
      O.stride(1),
  )
  return O

If you are unsure about the launch grid, checkout my introduction to Triton

Take a closer look at how we initialized our Ls and Ms. We are keeping one for each row block of Output/Query, each of size B_r. There are N_r such blocks in total.

In the example above I was simply using B_r = 2 and B_c = 2. But in the above code the initialization is based on the device capacity. I have included the calculation for a T4 GPU. For any other GPU, we need to get the SRAM capacity and adjust these numbers accordingly. Now for the actual kernel implementation:

# Flash Attention V1
import triton
import triton.language as tl
import torch
import numpy as np
import pdb

@triton.jit
def flash_attn_v1_kernel(
    Q, K, V,
    N: tl.constexpr, M: tl.constexpr, D: tl.constexpr,
    Br: tl.constexpr,
    Bc: tl.constexpr,
    Nr: tl.constexpr,
    Nc: tl.constexpr,
    batch_stride: tl.constexpr,
    q_rstride: tl.constexpr,
    k_rstride: tl.constexpr, 
    v_rstride: tl.constexpr,
    lis, mis,
    O,
    o_rstride: tl.constexpr):
    
    """Flash Attention V1 kernel"""
    
    pid = tl.program_id(0)
    

    for j in range(Nc):
        k_offset = ((tl.arange(0, Bc) + j*Bc) * k_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * M * D
        # Using k_rstride and v_rstride as we are looking at the entire row at once, for each k v block 
        v_offset = ((tl.arange(0, Bc) + j*Bc) * v_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * M * D
        k_mask = k_offset < (pid + 1) * M*D
        v_mask = v_offset < (pid + 1) * M*D
        k_load = tl.load(K + k_offset, mask=k_mask, other=0)
        v_load = tl.load(V + v_offset, mask=v_mask, other=0)
        for i in range(Nr):
            q_offset = ((tl.arange(0, Br) + i*Br) * q_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * N * D
            q_mask = q_offset < (pid + 1) * N*D
            q_load = tl.load(Q + q_offset, mask=q_mask, other=0)
            # Compute attention
            s_ij = tl.dot(q_load, tl.trans(k_load))
            m_ij = tl.max(s_ij, axis=1, keep_dims=True)
            p_ij = tl.exp(s_ij - m_ij)
            l_ij = tl.sum(p_ij, axis=1, keep_dims=True)
            
            ml_offset = tl.arange(0, Br) + Br * i + pid * Nr * Br
            m = tl.load(mis + ml_offset)[:, None]
            l = tl.load(lis + ml_offset)[:, None]

            m_new = tl.where(m < m_ij, m_ij, m)

            l_new = tl.exp(m - m_new) * l + tl.exp(m_ij - m_new) * l_ij

            o_ij = tl.dot(p_ij, v_load)

            output_offset = ((tl.arange(0, Br) + i*Br) * o_rstride)[:, None] + (tl.arange(0, D))[None, :] + pid * N * D
            output_mask = output_offset < (pid + 1) * N*D
            o_current = tl.load(O + output_offset, mask=output_mask)

            o_new = (1/l_new) * (l * tl.exp(m - m_new) * o_current + tl.exp(m_ij - m_new) * o_ij)

            tl.store(O + output_offset, o_new, mask=output_mask)
            tl.store(mis + ml_offset, tl.reshape(m_new, (Br,)))
            tl.store(lis + ml_offset, tl.reshape(l_new, (Br,)))

Let’s understand whats happening here:

Create 1 kernel for each NxD matrix in the batch. In reality we would have one more dimension to parallelize across, the head dimension. But for understanding the implementation I think this would suffice.
In each kernel we do the following:
1. For each block of columns in K and V we load up the relevant part of the matrix (B_c x D) into the GPU SRAM (Current total SRAM usage = 2B_cD). This stays in the SRAM till we are done with all the row blocks
2. For each row block of Q, we load the block onto SRAM as well (Current total SRAM Usage = 2B_cD + BrD)
3. On chip we compute the dot product (s_ij), compute the local row-maxes (m_ij), the exp (p_ij), and the expsum (l_ij)
4. We load up the running stats for the i^th row block. Two vectors of size B_r x 1, which denotes the current global row-maxes (m_i) and the expsum (l_i). (Current SRAM usage: 2B_cD + B_rD + 2B_r)
5. We get the new estimates for the global m_i and l_i.
6. We load the part of the output for this block of Q and update it using the new running stats and the exponent trick, we then write this back into the HBM. (Current SRAM usage: 2B_cD + 2B_rD + 2B_r)
7. We write the updated running stats also into the HBM.
For a matrix of any size, aka any context length, at a time we will never materialize the full attention matrix, only a part of it always.
We managed to fuse together all the ops into a single kernel, reducing HBM access considerably.

Final SRAM usage stands although at 4BD + 2B, where B was initially calculated as M/4d where M is the SRAM capacity. Not sure if am missing something here. Please comment if you know why this is the case!

Block Sparse Attention and V2 and V3

I will keep this short as these versions keep the core idea but figured out better and better ways to do the same.

For Block Sparse Attention,

Consider we had masks for each block like in the case of causal attention. If for a given block we have the masks all set to zero then we can simply skip the entire block without computing anything really. Saving FLOPs. This is where the major gains were seen. To put this into perspective, in the case of BERT pre-training the algorithm gets a 15% boost over the best performing training setup at the time, whereas for GPT-2 we get a 3x over huggingface training implementation and ~ 2x over a Megatron setup.

Performance gain for autoregressive models, where we have a sparse mask. Source: Tri Dao et.al [2]

2. You can literally get the same performance in GPT2 in a fraction of the time, literally shaving off days from the training run, which is awesome!

In V2:

Notice how currently we can only do parallelization at the batch and head dimension. But if you simply just flip the order to look at all the column blocks for a given row block then we get the following advantages:
1. Each row block becomes embarrassingly parallel. How you know this is by looking at the illustrations above. You need all the column blocks for a given row block to fully form the attention output. If you were to run all the column blocks in parallel, you will end up with a race condition that will try to update the same rows of the output at the same time. But not if you do it the other way around. Although there are atomic add operators in triton which could help, they may potentially set us back.
2. We can avoid hitting the HBM to get the global Ms and Ls. We can initialize one on the chip for each kernel.
3. Also we do not have to scale all the output update terms with the new estimate of L. We can just compute stuff without dividing by L and at the end of all the column blocks, simply divide the output with the latest estimate of L, saving some FLOPS again!
Much of the improvement also comes in the form of the backward kernel. I am omitting all the backward kernels from this. But they are a fun exercise to try and implement, although they are significantly more complex.

Here are some benchmarks:

Performance benchmark of FA v2 against existing attention algorithms. Source: Tri Dao et.al [3]

The actual implementations of these kernels need to take into account various nuances that we encounter in the real world. I have tried to keep it simple. But do check them out here.

More recently in V3:

Newer GPUs, especially the Hopper and Blackwell GPUs, have low precision modes (FP8 in Hopper and GP4 in Blackwell), which can double and quadruple the throughput for the same power and chip area and more specialized GEMM (General Matrix Multiply) kernels, which the previous version of the algorithm fails to capitalize on. This is because there are many operations which are non-GEMM, like softmax, which reduces the utilization of these specialized GPU kernels.
The FA v1 and v2 are essentially synchronous. Recall in the v2 description I mentioned that we are limited when column blocks try to write to the same output pointers, or when we have to go step by step using the output from the previous steps. Well these modern GPUs can make use special instructions to break this synchrony.

We overlap the comparatively low-throughput non-GEMM operations involved in softmax, such as floating point multiply-add and exponential, with the asynchronous WGMMA instructions for GEMM. As part of this, we rework the FlashAttention-2 algorithm to circumvent certain sequential dependencies between softmax and the GEMMs. For example, in the 2-stage version of our algorithm, while softmax executes on one block of the scores matrix, WGMMA executes in the asynchronous proxy to compute the next block.
Flash Attention v3, Shah et.al

They also adapted the algorithm to target these specialized low precision Tensor cores on these new devices, significantly increasing the FLOPs.

Some more benchmarks:

FA v3 Performance gain over v2. Source: Shah et. al [5]

Conclusion

There is much to admire in their work here. The floor for this technical skill level often seemed high owing to the low level details. But hopefully tools like Triton could change the game and get more people into this! The future is bright.

References

[1] Qwen 2.5-7B-Instruct-1M Huggingface Model Page

[2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

[3] Tri Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

[4] NVIDIA Hopper Architecture Page

[5] Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

[6] Single-precision floating-point format, Wikipedia

The post Kernel Case Study: Flash Attention appeared first on Towards Data Science.

Agentic GraphRAG for Commercial Contracts

Tomaz Bratanic — Thu, 03 Apr 2025 04:27:24 +0000

In every business, legal contracts are foundational documents that define the relationships, obligations, and responsibilities between parties. Whether it’s a partnership agreement, an NDA, or a supplier contract, these documents often contain critical information that drives decision-making, risk management, and compliance. However, navigating and extracting insights from these contracts can be a complex and time-consuming process.

In this post, we’ll explore how we can streamline the process of understanding and working with legal contracts by implementing an end-to-end solution using Agentic Graphrag. I see GraphRAG as an umbrella term for any method that retrieves or reasons over information stored in a knowledge graph, enabling more structured and context-aware responses.

By structuring legal contracts into a knowledge graph in Neo4j, we can create a powerful repository of information that’s easy to query and analyze. From there, we’ll build a LangGraph agent that allows users to ask specific questions about the contracts, making it possible to rapidly uncover new insights.

The code is available in this GitHub repository.

Why structuring data matters

Some domains work well with naive RAG, but legal contracts present unique challenges.

Pulling information from irrelevant contracts using naive vector RAG

As shown in the image, relying solely on a vector index to retrieve relevant chunks can introduce risks, such as pulling information from irrelevant contracts. This is because legal language is highly structured, and similar wording across different agreements can lead to incorrect or misleading retrieval. These limitations highlight the need for a more structured approach, such as GraphRAG, to ensure precise and context-aware retrieval.

To implement GraphRAG, we first need to construct a knowledge graph.

Legal knowledge graph containing both structured and unstructured information.

To build a knowledge graph for legal contracts, we need a way to extract structured information from documents and store it alongside the raw text. An LLM can help by reading through contracts and identifying key details such as parties, dates, contract types, and important clauses. Instead of treating the contract as just a block of text, we break it down into structured components that reflect its underlying legal meaning. For example, an LLM can recognize that “ACME Inc. agrees to pay $10,000 per month starting January 1, 2024” contains both a payment obligation and a start date, which we can then store in a structured format.

Once we have this structured data, we store it in a knowledge graph, where entities like companies, agreements, and clauses are represented as represented along with their relationships. The unstructured text remains available, but now we can use the structured layer to refine our searches and make retrieval far more precise. Instead of just fetching the most relevant text chunks, we can filter contracts based on their attributes. This means we can answer questions that naive RAG would struggle with, such as how many contracts were signed last month or whether we have any active agreements with a specific company. These questions require aggregation and filtering, which isn’t possible with standard vector-based retrieval alone.

By combining structured and unstructured data, we also make retrieval more context-aware. If a user asks about a contract’s payment terms, we ensure that the search is constrained to the right agreement rather than relying on text similarity, which might pull in terms from unrelated contracts. This hybrid approach overcomes the limitations of naive RAG and allows for a much deeper and more reliable analysis of legal documents.

Graph construction

We’ll leverage an LLM to extract structured information from legal documents, using the CUAD (Contract Understanding Atticus Dataset), a widely used benchmark dataset for contract analysis licensed under CC BY 4.0. CUAD dataset contains over 500 contracts, making it an ideal dataset for evaluating our structured extraction pipeline.

The token count distribution for the contracts is visualized below.

Most contracts in this dataset are relatively short, with token counts below 10,000. However, there are some much longer contracts, with a few reaching up to 80,000 tokens. These long contracts are rare, while shorter ones make up the majority. The distribution shows a steep drop-off, meaning long contracts are the exception rather than the rule.

We’re using Gemini-2.0-Flash for extraction, which has a 1 million token input limit, so handling these contracts isn’t a problem. Even the longest contracts in our dataset (around 80,000 tokens) fit well within the model’s capacity. Since most contracts are much shorter, we don’t have to worry about truncation or breaking documents into smaller chunks for processing.

Structured data extraction

Most commercial LLMs have the option to use Pydantic objects to define the schema of the output. An example for location:

class Location(BaseModel):
    """
    Represents a physical location including address, city, state, and country.
    """

    address: Optional[str] = Field(
        ..., description="The street address of the location.Use None if not provided"
    )
    city: Optional[str] = Field(
        ..., description="The city of the location.Use None if not provided"
    )
    state: Optional[str] = Field(
        ..., description="The state or region of the location.Use None if not provided"
    )
    country: str = Field(
        ...,
        description="The country of the location. Use the two-letter ISO standard.",
    )

When using LLMs for structured output, Pydantic helps define a clear schema by specifying the types of attributes and providing descriptions that guide the model’s responses. Each field has a type, such as str or Optional[str], and a description that tells the LLM exactly how to format the output.

For example, in a Location model, we define key attributes like address, city, state, and country, specifying what data is expected and how it should be structured. The country field, for instance, follows two-letter country code standard like "US", "FR", or "JP", instead of inconsistent variations like “United States” or “USA.” This principle applies to other structured data as well, ISO 8601 keeps dates in a standard format (YYYY-MM-DD), and so on.

By defining structured output with Pydantic, we make LLM responses more reliable, machine-readable, and easier to integrate into databases or APIs. Clear field descriptions further help the model generate correctly formatted data, reducing the need for post-processing.

The Pydantic schema models can be more sophisticated like the Contract model below, which captures key details of a legal agreement, ensuring the extracted data follows a standardized structure.

class Contract(BaseModel):
    """
    Represents the key details of the contract.
    """
  
    summary: str = Field(
        ...,
        description=("High level summary of the contract with relevant facts and details. Include all relevant information to provide full picture."
        "Do no use any pronouns"),
    )
    contract_type: str = Field(
        ...,
        description="The type of contract being entered into.",
        enum=CONTRACT_TYPES,
    )
    parties: List[Organization] = Field(
        ...,
        description="List of parties involved in the contract, with details of each party's role.",
    )
    effective_date: str = Field(
        ...,
        description=(
            "Enter the date when the contract becomes effective in yyyy-MM-dd format."
            "If only the year (e.g., 2015) is known, use 2015-01-01 as the default date."
            "Always fill in full date"
        ),
    )
    contract_scope: str = Field(
        ...,
        description="Description of the scope of the contract, including rights, duties, and any limitations.",
    )
    duration: Optional[str] = Field(
        None,
        description=(
            "The duration of the agreement, including provisions for renewal or termination."
            "Use ISO 8601 durations standard"
        ),
    )
  
    end_date: Optional[str] = Field(
        None,
        description=(
            "The date when the contract expires. Use yyyy-MM-dd format."
            "If only the year (e.g., 2015) is known, use 2015-01-01 as the default date."
            "Always fill in full date"
        ),
    )
    total_amount: Optional[float] = Field(
        None, description="Total value of the contract."
    )
    governing_law: Optional[Location] = Field(
        None, description="The jurisdiction's laws governing the contract."
    )
    clauses: Optional[List[Clause]] = Field(
        None, description=f"""Relevant summaries of clause types. Allowed clause types are {CLAUSE_TYPES}"""
    )

This contract schema organizes key details of legal agreements in a structured way, making it easier to analyze with LLMs. It includes different types of clauses, such as confidentiality or termination, each with a short summary. The parties involved are listed with their names, locations, and roles, while contract details cover things like start and end dates, total value, and governing law. Some attributes, such as governing law, can be defined using nested models, enabling more detailed and complex outputs.

The nested object approach works well with some AI models that handle complex data relationships, while others may struggle with deeply nested details.

We can test our approach using the following example. We are using the LangChain framework to orchestrate LLMs.

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
llm.with_structured_output(Contract).invoke(
    "Tomaz works with Neo4j since 2017 and will make a billion dollar until 2030."
    "The contract was signed in Las Vegas"
)

which outputs

Contract(
    summary="Tomaz works with Neo4j since 2017 and will make a billion dollar until 2030.",
    contract_type="Service",
    parties=[
        Organization(
            name="Tomaz",
            location=Location(
                address=None,
                city="Las Vegas",
                state=None,
                country="US"
            ),
            role="employee"
        ),
        Organization(
            name="Neo4j",
            location=Location(
                address=None,
                city=None,
                state=None,
                country="US"
            ),
            role="employer"
        )
    ],
    effective_date="2017-01-01",
    contract_scope="Tomaz will work with Neo4j",
    duration=None,
    end_date="2030-01-01",
    total_amount=1_000_000_000.0,
    governing_law=None,
    clauses=None
)

Now that our contract data is in a structured format, we can define the Cypher query needed to import it into Neo4j, mapping entities, relationships, and key clauses into a graph structure. This step transforms raw extracted data into a queryable knowledge graph, enabling efficient traversal and retrieval of contract insights.

UNWIND $data AS row
MERGE (c:Contract {file_id: row.file_id})
SET c.summary = row.summary,
    c.contract_type = row.contract_type,
    c.effective_date = date(row.effective_date),
    c.contract_scope = row.contract_scope,
    c.duration = row.duration,
    c.end_date = CASE WHEN row.end_date IS NOT NULL THEN date(row.end_date) ELSE NULL END,
    c.total_amount = row.total_amount
WITH c, row
CALL (c, row) {
    WITH c, row
    WHERE row.governing_law IS NOT NULL
    MERGE (c)-[:HAS_GOVERNING_LAW]->(l:Location)
    SET l += row.governing_law
}
FOREACH (party IN row.parties |
    MERGE (p:Party {name: party.name})
    MERGE (p)-[:HAS_LOCATION]->(pl:Location)
    SET pl += party.location
    MERGE (p)-[pr:PARTY_TO]->(c)
    SET pr.role = party.role
)
FOREACH (clause IN row.clauses |
    MERGE (c)-[:HAS_CLAUSE]->(cl:Clause {type: clause.clause_type})
    SET cl.summary = clause.summary
)

This Cypher query imports structured contract data into Neo4j by creating Contract nodes with attributes such as summary, contract_type, effective_date, duration, and total_amount. If a governing law is specified, it links the contract to a Location node. Parties involved in the contract are stored as Party nodes, with each party connected to a Location and assigned a role in relation to the contract. The query also processes clauses, creating Clause nodes and linking them to the contract while storing their type and summary.

After processing and importing the contracts, the resulting graph follows the following graph schema.

Imported legal graph schema

Let’s also take a look at a single contract.

This graph represents a contract structure where a contract (orange node) connects to various clauses (red nodes), parties (blue nodes), and locations (violet nodes). The contract has three clauses: Renewal & Termination, Liability & Indemnification, and Confidentiality & Non-Disclosure. Two parties, Modus Media International and Dragon Systems, Inc., are involved, each linked to their respective locations, Netherlands (NL) and United States (US). The contract is governed by U.S. law. The contract node also contains additional metadata, including dates and other relevant details.

A public read-only instance containing CUAD legal contracts is available with the following credentials.

URI: neo4j+s://demo.neo4jlabs.com
username: legalcontracts
password: legalcontracts
database: legalcontracts

Entity resolution

Entity resolution in legal contracts is challenging due to variations in how companies, individuals, and locations are referenced. A company might appear as “Acme Inc.” in one contract and “Acme Corporation” in another, requiring a process to determine whether they refer to the same entity.

One approach is to generate candidate matches using text embeddings or string distance metrics like Levenshtein distance. Embeddings capture semantic similarity, while string distance measures character-level differences. Once candidates are identified, additional evaluation is needed, comparing metadata such as addresses or tax IDs, analyzing shared relationships in the graph, or incorporating human review for critical cases.

For resolving entities at scale, both open-source solutions like Dedupe and commercial tools like Senzing offer automated methods. Choosing the right approach depends on data quality, accuracy requirements, and whether manual oversight is feasible.

With the legal graph constructed, we can move onto the agentic GraphRAG implementation.

Agentic GraphRAG

Agentic architectures vary widely in complexity, modularity, and reasoning capabilities. At their core, these architectures involve an LLM acting as a central reasoning engine, often supplemented with tools, memory, and orchestration mechanisms. The key differentiator is how much autonomy the LLM has in making decisions and how interactions with external systems are structured.

One of the simplest and most effective designs, particularly for chatbot-like implementations, is a direct LLM-with-tools approach. In this setup, the LLM serves as the decision-maker, dynamically selecting which tools to invoke (if any), retrying operations when necessary, and executing multiple tools in sequence to fulfill complex requests.

The diagram represents a simple LangGraph agent workflow. It begins at __start__, moving to the assistant node, where the LLM processes user input. From there, the assistant can either call tools to fetch relevant information or transition directly to __end__ to complete the interaction. If a tool is used, the assistant processes the response before deciding whether to call another tool or end the session. This structure allows the agent to autonomously determine when external information is needed before responding.

This approach is particularly well-suited to stronger commercial models like Gemini or GPT-4o, which excel at reasoning and self-correction.

Tools

LLMs are powerful reasoning engines, but their effectiveness often depends on how well they are equipped with external tools. These tools , whether database queries, APIs, or search functions, extend an LLM’s ability to retrieve facts, perform calculations, or interact with structured data.

Designing tools that are both general enough to handle diverse queries and precise enough to return meaningful results is more art than science. What we’re really building is a semantic layer between the LLM and the underlying data. Rather than requiring the LLM to understand the exact structure of a Neo4j knowledge graph or a database schema, we define tools that abstract away these complexities.

With this approach, the LLM doesn’t need to know whether contract information is stored as graph nodes and relationships or as raw text in a document store. It only needs to invoke the right tool to fetch relevant data based on a user’s question.

In our case, the contract retrieval tool serves as this semantic interface. When a user asks about contract terms, obligations, or parties, the LLM calls a structured query tool that translates the request into a database query, retrieves relevant information, and presents it in a format the LLM can interpret and summarize. This enables a flexible, model-agnostic system where different LLMs can interact with contract data without needing direct knowledge of its storage or structure.

There’s no one-size-fits-all standard for designing an optimal toolset. What works well for one model may fail for another. Some models handle ambiguous tool instructions gracefully, while others struggle with complex parameters or require explicit prompting. The trade-off between generality and task-specific efficiency means tool design requires iteration, testing, and fine-tuning for the LLM in use.
For contract analysis, an effective tool should retrieve contracts and summarize key terms without requiring users to phrase queries rigidly. Achieving this flexibility depends on thoughtful prompt engineering, robust schema design, and adaptation to different LLM capabilities. As models evolve, so do strategies for making tools more intuitive and effective.

In this section, we’ll explore different approaches to tool implementation, comparing their flexibility, effectiveness, and compatibility with various LLMs.

My preferred approach is to dynamically and deterministically construct a Cypher query and execute it against the database. This method ensures consistent and predictable query generation while maintaining implementation flexibility. By structuring queries this way, we reinforce the semantic layer, allowing user inputs to be seamlessly translated into database retrievals. This keeps the LLM focused on retrieving relevant information rather than understanding the underlying data model.

Our tool is intended to identify relevant contracts, so we need to provide the LLM with options to search contracts based on various attributes. The input description is again provided as a Pydantic object.

class ContractInput(BaseModel):
    min_effective_date: Optional[str] = Field(
        None, description="Earliest contract effective date (YYYY-MM-DD)"
    )
    max_effective_date: Optional[str] = Field(
        None, description="Latest contract effective date (YYYY-MM-DD)"
    )
    min_end_date: Optional[str] = Field(
        None, description="Earliest contract end date (YYYY-MM-DD)"
    )
    max_end_date: Optional[str] = Field(
        None, description="Latest contract end date (YYYY-MM-DD)"
    )
    contract_type: Optional[str] = Field(
        None, description=f"Contract type; valid types: {CONTRACT_TYPES}"
    )
    parties: Optional[List[str]] = Field(
        None, description="List of parties involved in the contract"
    )
    summary_search: Optional[str] = Field(
        None, description="Inspect summary of the contract"
    )
    country: Optional[str] = Field(
        None, description="Country where the contract applies. Use the two-letter ISO standard."
    )
    active: Optional[bool] = Field(None, description="Whether the contract is active")
    monetary_value: Optional[MonetaryValue] = Field(
        None, description="The total amount or value of a contract"
    )

With LLM tools, attributes can take various forms depending on their purpose. Some fields are simple strings, such as contract_type and country, which store single values. Others, like parties, are lists of strings, allowing multiple entries (e.g., multiple entities involved in a contract).

Beyond basic data types, attributes can also represent complex objects. For example, monetary_value uses a MonetaryValue object, which includes structured data such as currency type and the operator. While attributes with nested objects offer a clear and structured representation of data, models tend to struggle to handle them effectively, so we should keep them simple.

As part of this project, we’re experimenting with an additional cypher_aggregation attribute, providing the LLM with greater flexibility for scenarios that require specific filtering or aggregation.

cypher_aggregation: Optional[str] = Field(
    None,
    description="""Custom Cypher statement for advanced aggregations and analytics.

    This will be appended to the base query:
    ```
    MATCH (c:Contract)
    
    WITH c, summary, contract_type, contract_scope, effective_date, end_date, parties, active, monetary_value, contract_id, countries
    
    ```
    
    Examples:
    
    1. Count contracts by type:
    ```
    RETURN contract_type, count(*) AS count ORDER BY count DESC
    ```
    
    2. Calculate average contract duration by type:
    ```
    WITH contract_type, effective_date, end_date
    WHERE effective_date IS NOT NULL AND end_date IS NOT NULL
    WITH contract_type, duration.between(effective_date, end_date).days AS duration
    RETURN contract_type, avg(duration) AS avg_duration ORDER BY avg_duration DESC
    ```
    
    3. Calculate contracts per effective date year:
    ```
    RETURN effective_date.year AS year, count(*) AS count ORDER BY year
    ```
    
    4. Counts the party with the highest number of active contracts:
    ```
    UNWIND parties AS party
    WITH party.name AS party_name, active, count(*) AS contract_count
    WHERE active = true
    RETURN party_name, contract_count
    ORDER BY contract_count DESC
    LIMIT 1
    ```
    """

The cypher_aggregation attribute allows LLMs to define custom Cypher statements for advanced aggregations and analytics. It extends the base query by appending question-specified aggregation logic, enabling flexible filtering and computation.

This feature supports use cases such as counting contracts by type, calculating average contract duration, analyzing contract distributions over time, and identifying key parties based on contract activity. By leveraging this attribute, the LLM can dynamically generate insights tailored to specific analytical needs without requiring predefined query structures.

While this flexibility is valuable, it should be carefully evaluated, as increased adaptability comes at the cost of reduced consistency and robustness due to the added complexity of the operation.

We must clearly define the function’s name and description when presenting it to the LLM. A well-structured description helps guide the model in using the function correctly, ensuring it understands its purpose, expected inputs, and outputs. This reduces ambiguity and improves the LLM’s ability to generate meaningful and reliable queries.

class ContractSearchTool(BaseTool):
    name: str = "ContractSearch"
    description: str = (
        "useful for when you need to answer questions related to any contracts"
    )
    args_schema: Type[BaseModel] = ContractInput

Finally, we need to implement a function that processes the given inputs, constructs the corresponding Cypher statement, and executes it efficiently.

The core logic of the function centers on constructing the Cypher statement. We begin by matching the contract as the foundation of the query.

cypher_statement = "MATCH (c:Contract) "

Next, we need to implement the function that processes the input parameters. In this example, we primarily use attributes to filter contracts based on the given criteria.

Simple property filtering
For example, the contract_type attribute is used to perform simple node property filtering.

if contract_type:
    filters.append("c.contract_type = $contract_type")
    params["contract_type"] = contract_type

This code adds a Cypher filter for contract_type while using query parameters for values to prevent query injection security issue.

Since the possible contract type values are presented in the attribute description

contract_type: Optional[str] = Field(
    None, description=f"Contract type; valid types: {CONTRACT_TYPES}"
)

we don’t have to worry about mapping values from input to valid contract types as the LLM will handle that.

Inferred property filtering

We’re building tools for an LLM to interact with a knowledge graph, where the tools serve as an abstraction layer over structured queries. A key feature is the ability to use inferred properties at runtime, similar to an ontology but dynamically computed.

if active is not None:
    operator = ">=" if active else "<"
    filters.append(f"c.end_date {operator} date()")

Here, active acts as a runtime classification, determining whether a contract is ongoing (>= date()) or expired (< date()). This logic extends structured KG queries by computing properties only when needed, enabling more flexible LLM reasoning. By handling logic like this within tools, we ensure the LLM interacts with simplified, intuitive operations, keeping it focused on reasoning rather than query formulation.

Neighbor filtering

Sometimes filtering depends on neighboring nodes, such as restricting results to contracts involving specific parties. The parties attribute is an optional list, and when provided, it ensures only contracts linked to those entities are considered:

if parties:
    parties_filter = []
    for i, party in enumerate(parties):
        party_param_name = f"party_{i}"
        parties_filter.append(
            f"""EXISTS {{
            MATCH (c)<-[:PARTY_TO]-(party)
            WHERE toLower(party.name) CONTAINS ${party_param_name}
        }}"""
        )
        params[party_param_name] = party.lower()

This code filters contracts based on their associated parties, treating the logic as AND, meaning all specified conditions must be met for a contract to be included. It iterates through the provided parties list and constructs a query where each party condition must hold.

For each party, a unique parameter name is generated to avoid conflicts. The EXISTS clause ensures that the contract has a PARTY_TO relationship to a party whose name contains the specified value. The name is converted to lowercase to allow case-insensitive matching. Each party condition is added separately, enforcing an implicit AND between them.

If more complex logic were needed, such as supporting OR conditions or allowing different matching criteria, the input would need to change. Instead of a simple list of party names, a structured input format specifying operators would be required.

Additionally, we could implement a party-matching method that tolerates minor typos, improving the user experience by handling variations in spelling and formatting.

Custom operator filtering

To add more flexibility, we can introduce an operator object as a nested attribute, allowing more control over filtering logic. Instead of hardcoding comparisons, we define an enumeration for operators and use it dynamically.

For example, with monetary values, a contract might need to be filtered based on whether its total amount is greater than, less than, or exactly equal to a specified value. Instead of assuming a fixed comparison logic, we define an enum that represents the possible operators:

class NumberOperator(str, Enum):
    EQUALS = "="
    GREATER_THAN = ">"
    LESS_THAN = "<"

class MonetaryValue(BaseModel):
    """The total amount or value of a contract"""
    value: float
    operator: NumberOperator

if monetary_value:
    filters.append(f"c.total_amount {monetary_value.operator.value} $total_value")
    params["total_value"] = monetary_value.value

This approach makes the system more expressive. Instead of rigid filtering rules, the tool interface allows the LLM to specify not just a value but how it should be compared, making it easier to handle a broader range of queries while keeping the LLM’s interaction simple and declarative.

Some LLMs struggle with nested objects as inputs, making it harder to handle structured operator-based filtering. Adding a between operator introduces additional complexity since it requires two separate values, which can lead to ambiguity in parsing and input validation.

Min and Max attributes

To keep things simpler, I tend to gravitate toward using min and max attributes for dates, as this naturally supports range filtering and makes the between logic straightforward.

if min_effective_date:
    filters.append("c.effective_date >= date($min_effective_date)")
    params["min_effective_date"] = min_effective_date
if max_effective_date:
    filters.append("c.effective_date <= date($max_effective_date)")
    params["max_effective_date"] = max_effective_date

This function filters contracts based on an effective date range by adding an optional lower and upper bound condition when min_effective_date and max_effective_date are provided, ensuring that only contracts within the specified date range are included.

Semantic search

An attribute can also be used for semantic search, where instead of relying on a vector index upfront, we use a post-filtering approach to metadata filtering. First, structured filters, like date ranges, monetary values, or parties, are applied to narrow down the candidate set. Then, vector search is performed over this filtered subset to rank results based on semantic similarity.

if summary_search:
    cypher_statement += (
        "WITH c, vector.similarity.cosine(c.embedding, $embedding) "
        "AS score ORDER BY score DESC WITH c, score WHERE score > 0.9 "
    )  # Define a threshold limit
    params["embedding"] = embeddings.embed_query(summary_search)
else:  # Else we sort by latest
    cypher_statement += "WITH c ORDER BY c.effective_date DESC "

This code applies semantic search when summary_search is provided by computing cosine similarity between the contract’s embedding and the query embedding, ordering results by relevance, and filtering out low-scoring matches with a threshold of 0.9. Otherwise, it defaults to sorting contracts by the most recent effective_date.

Dynamic queries

The cypher aggregation attribute is an experiment I wanted to test that gives the LLM a degree of partial text2cypher capability, allowing it to dynamically generate aggregations after the initial structured filtering. Instead of predefining every possible aggregation, this approach lets the LLM specify calculations like counts, averages, or grouped summaries on demand, making queries more flexible and expressive. However, since this shifts more query logic to the LLM, ensuring all generated queries work correctly becomes challenging, as malformed or incompatible Cypher statements can break execution. This trade-off between flexibility and reliability is a key consideration in designing the system.

if cypher_aggregation:
    cypher_statement += """WITH c, c.summary AS summary, c.contract_type AS contract_type, 
      c.contract_scope AS contract_scope, c.effective_date AS effective_date, c.end_date AS end_date,
      [(c)<-[r:PARTY_TO]-(party) | {party: party.name, role: r.role}] AS parties, c.end_date >= date() AS active, c.total_amount as monetary_value, c.file_id AS contract_id,
      apoc.coll.toSet([(c)<-[:PARTY_TO]-(party)-[:LOCATED_IN]->(country) | country.name]) AS countries """
    cypher_statement += cypher_aggregation

If no cypher aggregation is provided, we return the total count of identified contracts along with only five example contracts to avoid overwhelming the prompt. Handling excessive rows is crucial, as an LLM struggling with a massive result set isn’t useful. Additionally, LLM producing answers with 100 contract titles isn’t a good user experience either.

cypher_statement += """WITH collect(c) AS nodes
RETURN {
    total_count_of_contracts: size(nodes),
    example_values: [
      el in nodes[..5] |
      {summary:el.summary, contract_type:el.contract_type, 
       contract_scope: el.contract_scope, file_id: el.file_id, 
        effective_date: el.effective_date, end_date: el.end_date,
        monetary_value: el.total_amount, contract_id: el.file_id, 
        parties: [(el)<-[r:PARTY_TO]-(party) | {name: party.name, role: r.role}], 
        countries: apoc.coll.toSet([(el)<-[:PARTY_TO]-()-[:LOCATED_IN]->(country) | country.name])}
    ]
} AS output"""

This cypher statement collects all matching contracts into a list, returning the total count and up to five example contracts with key attributes, including summary, type, scope, dates, monetary value, associated parties with roles, and unique country locations.
Now that our contract search tool is built, we hand it off to the LLM and just like that, we have agentic GraphRAG implemented.

Agent Benchmark

If you’re serious about implementing agentic GraphRAG, you need an evaluation dataset, not just as a benchmark but as a foundation for the entire project. A well-constructed dataset helps define the scope of what the system should handle, ensuring that initial development aligns with real-world use cases. Beyond that, it becomes an invaluable tool for evaluating performance, allowing you to measure how well the LLM interacts with the graph, retrieves information, and applies reasoning. It’s also essential for prompt engineering optimizations, letting you iteratively refine queries, tool use, and response formatting with clear feedback rather than guesswork. Without a structured dataset, you’re flying blind, making improvements harder to quantify and inconsistencies more difficult to catch.

The code for the benchmark is available on GitHub.

I have compiled a list of 22 questions which we will use to evaluate the system. Additionally, we are going to introduce a new metric called answer_satisfaction where we will be provide a custom prompt.

answer_satisfaction = AspectCritic(
    name="answer_satisfaction",
    definition="""You will evaluate an ANSWER to a legal QUESTION based on a provided SOLUTION.

Rate the answer on a scale from 0 to 1, where:
- 0 = incorrect, substantially incomplete, or misleading
- 1 = correct and sufficiently complete

Consider these evaluation criteria:
1. Factual correctness is paramount - the answer must not contradict the solution
2. The answer must address the core elements of the solution
3. Additional relevant information beyond the solution is acceptable and may enhance the answer
4. Technical legal terminology should be used appropriately if present in the solution
5. For quantitative legal analyses, accurate figures must be provided

+ fewshots
"""

Many questions can return a large amount of information. For example, asking for contracts signed before 2020 might yield hundreds of results. Since the LLM receives both the total count and a few example entries, our evaluation should focus on the total count, rather than which specific examples the LLM chooses to show.

Benchmark results.

The provided results indicate that all evaluated models (Gemini 1.5 Pro, Gemini 2.0 Flash, and GPT-4o) perform similarly well for most tool calls, with GPT-4o slightly outperforming the Gemini models (0.82 vs. 0.77). The noticeable difference emerges primarily when partial text2cypher is used, particularly for various aggregation operations.

Note that this is only 22 fairly simple questions, so we didn’t really explore reasoning capabilities of LLMs.

Additionally, I’ve seen projects where accuracy can be improved significantly by leveraging Python for aggregations, as LLMs typically handle Python code generation and execution better than generating complex Cypher queries directly.

Web Application

We’ve also built a simple React web application, powered by LangGraph hosted on FastAPI, which streams responses directly to the frontend. Special thanks to Anej Gorkic for creating the web app.

You can launch the entire stack with the following command:

docker compose up

And navigate to localhost:5173

Summary

As LLMs gain stronger reasoning capabilities, they, when paired with the right tools, can become powerful agents for navigating complex domains like legal contracts. In this post, we’ve only scratched the surface, focusing on core contract attributes while barely touching the rich variety of clauses found in real-world agreements. There’s significant room for growth, from expanding clause coverage to refining tool design and interaction strategies.

The code is available on GitHub.

Images

All images in this post were created by the author.

The post Agentic GraphRAG for Commercial Contracts appeared first on Towards Data Science.

The Art of Noise

Muhammad Ardi — Thu, 03 Apr 2025 01:12:22 +0000

Introduction

In my last several articles I talked about generative deep learning algorithms, which mostly are related to text generation tasks. So, I think it would be interesting to switch to generative algorithms for image generation now. We knew that nowadays there have been plenty of deep learning models specialized for generating images out there, such as Autoencoder, Variational Autoencoder (VAE), Generative Adversarial Network (GAN) and Neural Style Transfer (NST). I actually got some of my writings about these topics posted on Medium as well. I provide you the links at the end of this article if you want to read them.

In today’s article, I would like to discuss the so-called diffusion model — one of the most impactful models in the field of deep learning for image generation. The idea of this algorithm was first proposed in the paper titled Deep Unsupervised Learning using Nonequilibrium Thermodynamics written by Sohl-Dickstein et al. back in 2015 [1]. Their framework was then developed further by Ho et al. in 2020 in their paper titled Denoising Diffusion Probabilistic Models [2]. DDPM was later adapted by OpenAI and Google to develop DALLE-2 and Imagen, which we knew that these models have impressive capabilities to generate high-quality images.

How Diffusion Model Works

Generally speaking, diffusion model works by generating image from noise. We can think of it like an artist transforming a splash of paint on a canvas into a beautiful artwork. In order to do so, the diffusion model needs to be trained first. There are two main steps required to be followed to train the model, namely forward diffusion and backward diffusion.

Figure 1. The forward and backward diffusion process [3].

As you can see in the above figure, forward diffusion is a process where Gaussian noise is applied to the original image iteratively. We keep adding the noise until the image is completely unrecognizable, at which point we can say that the image now lies in the latent space. Different from Autoencoders and GANs where the latent space typically has a lower dimension than the original image, the latent space in DDPM maintains the exact same dimensionality as the original one. This noising process follows the principle of a Markov Chain, meaning that the image at timestep t is affected only by timestep t-1. Forward diffusion is considered easy since what we basically do is just adding some noise step by step.

The second training phase is called backward diffusion, which our objective here is to remove the noise little by little until we obtain a clear image. This process follows the principle of the reverse Markov Chain, where the image at timestep t-1 can only be obtained based on the image at timestep t. Such a denoising process is really difficult since we need to guess which pixels are noise and which ones belong to the actual image content. Thus, we need to employ a neural network model to do so.

DDPM uses U-Net as the basis of the deep learning architecture for backward diffusion. However, instead of using the original U-Net model [4], we need to make several modifications to it so that it will be more suitable for our task. Later on, I am going to train this model on the MNIST Handwritten Digit dataset [5], and we will see whether it can generate similar images.

Well, that was pretty much all the fundamental concepts you need to know about diffusion models for now. In the next sections we are going to get even deeper into the details while implementing the algorithm from scratch.

PyTorch Implementation

We are going to start by importing the required modules. In case you’re not yet familiar with the imports below, both torch and torchvision are the libraries we’ll use for preparing the model and the dataset. Meanwhile, matplotlib and tqdm will help us display images and progress bars.

# Codeblock 1
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from tqdm import tqdm

As the modules have been imported, the next thing to do is to initialize some config parameters. Look at the Codeblock 2 below for the details.

# Codeblock 2
IMAGE_SIZE     = 28     #(1)
NUM_CHANNELS   = 1      #(2)

BATCH_SIZE     = 2
NUM_EPOCHS     = 10
LEARNING_RATE  = 0.001

NUM_TIMESTEPS  = 1000   #(3)
BETA_START     = 0.0001 #(4)
BETA_END       = 0.02   #(5)
TIME_EMBED_DIM = 32     #(6)
DEVICE = torch.device("cuda" if torch.cuda.is_available else "cpu")  #(7)
DEVICE

# Codeblock 2 Output
device(type='cuda')

At the lines marked with #(1) and #(2) I set IMAGE_SIZE and NUM_CHANNELS to 28 and 1, which these numbers are obtained from the image dimension in the MNIST dataset. The BATCH_SIZE, NUM_EPOCHS, and LEARNING_RATE variables are pretty straightforward, so I don’t think I need to explain them further.

At line #(3), the variable NUM_TIMESTEPS denotes the number of iterations in the forward and backward diffusion process. Timestep 0 is the condition where the image is in its original state (the leftmost image in Figure 1). In this case, since we set this parameter to 1000, timestep number 999 is going to be the condition where the image is completely unrecognizable (the rightmost image in Figure 1). It is important to keep in mind that the choice of the number of timesteps involves a tradeoff between model accuracy and computational cost. If we assign a small value for NUM_TIMESTEPS, the inference time is going to be shorter, yet the resulting image might not be really good since the model has fewer steps to refine the image in the backward diffusion stage. On the other hand, increasing NUM_TIMESTEPS will slow down the inference process, but we can expect the output image to have better quality thanks to the gradual denoising process which results in a more precise reconstruction.

Next, the BETA_START (#(4)) and BETA_END (#(5)) variables are used to control the amount of Gaussian noise added at each timestep, whereas TIME_EMBED_DIM (#(6)) is employed to determine the feature vector length for storing the timestep information. Lastly, at line #(7) I assign “cuda” to the DEVICE variable if Pytorch detects GPU installed in our machine. I highly recommend you run this project on GPU since training a diffusion model is computationally expensive. In addition to the above parameters, the values set for NUM_TIMESTEPS, BETA_START and BETA_END are all adopted directly from the DDPM paper [2].

The complete implementation will be done in several steps: constructing the U-Net model, preparing the dataset, defining noise scheduler for the diffusion process, training, and inference. We are going to discuss each of those stages in the following sub-sections.

The U-Net Architecture: Time Embedding

As I’ve mentioned earlier, the basis of a diffusion model is U-Net. This architecture is used because its output layer is suitable to represent an image, which definitely makes sense since it was initially introduced for image segmentation task at the first place. The following figure shows what the original U-Net architecture looks like.

Figure 2. The original U-Net model proposed in [4].

However, it is necessary to modify this architecture so that it can also take into account the timestep information. Not only that, since we will only use MNIST dataset, we also need to make the model smaller. Just remember the convention in deep learning that simpler models are often more effective for simple tasks.

In the figure below I show you the entire U-Net model that has been modified. Here you can see that the time embedding tensor is injected to the model at every stage, which will later be done by element-wise summation, allowing the model to capture the timestep information. Next, instead of repeating each of the downsampling and the upsampling stages four times like the original U-Net, in this case we will only repeat each of them twice. Additionally, it is worth noting that the stack of downsampling stages is also known as the encoder, whereas the stack of upsampling stages is often called the decoder.

Figure 3. The modified U-Net model for our diffusion task [3].

Now let’s start constructing the architecture by creating a class for generating the time embedding tensor, which the idea is similar to the positional embedding in Transformer. See the Codeblock 3 below for the details.

# Codeblock 3
class TimeEmbedding(nn.Module):
    def forward(self):
        time = torch.arange(NUM_TIMESTEPS, device=DEVICE).reshape(NUM_TIMESTEPS, 1)  #(1)
        print(f"time\t\t: {time.shape}")
          
        i = torch.arange(0, TIME_EMBED_DIM, 2, device=DEVICE)
        denominator = torch.pow(10000, i/TIME_EMBED_DIM)
        print(f"denominator\t: {denominator.shape}")
          
        even_time_embed = torch.sin(time/denominator)  #(1)
        odd_time_embed  = torch.cos(time/denominator)  #(2)
        print(f"even_time_embed\t: {even_time_embed.shape}")
        print(f"odd_time_embed\t: {odd_time_embed.shape}")
          
        stacked = torch.stack([even_time_embed, odd_time_embed], dim=2)  #(3)
        print(f"stacked\t\t: {stacked.shape}")
        time_embed = torch.flatten(stacked, start_dim=1, end_dim=2)  #(4)
        print(f"time_embed\t: {time_embed.shape}")
          
        return time_embed

What we basically do in the above code is to create a tensor of size NUM_TIMESTEPS × TIME_EMBED_DIM (1000×32), where every single row of this tensor will contain the timestep information. Later on, each of the 1000 timesteps will be represented by a feature vector of length 32. The values in the tensor themselves are obtained based on the two equations in Figure 4. In the Codeblock 3 above, these two equations are implemented at line #(1) and #(2), each forming a tensor having the size of 1000×16. Next, these tensors are combined using the code at line #(3) and #(4).

Here I also print out every single step done in the above codeblock so that you can get a better understanding of what is actually being done in the TimeEmbedding class. If you still want more explanation about the above code, feel free to read my previous post about Transformer which you can access through the link at the end of this article. Once you clicked the link, you can just scroll all the way down to the Positional Encoding section.

Figure 4. The sinusoidal positional encoding formula from the Transformer paper [6].

Now let’s check if the TimeEmbedding class works properly using the following testing code. The resulting output shows that it successfully produced a tensor of size 1000×32, which is exactly what we expected earlier.

# Codeblock 4
time_embed_test = TimeEmbedding()
out_test = time_embed_test()

# Codeblock 4 Output
time            : torch.Size([1000, 1])
denominator     : torch.Size([16])
even_time_embed : torch.Size([1000, 16])
odd_time_embed  : torch.Size([1000, 16])
stacked         : torch.Size([1000, 16, 2])
time_embed      : torch.Size([1000, 32])

The U-Net Architecture: DoubleConv

If you take a closer look at the modified architecture, you will see that we actually got lots of repeating patterns, such as the ones highlighted in yellow boxes in the following figure.

Figure 5. The processes done inside the yellow boxes will be implemented in the DoubleConv class [3].

These five yellow boxes share the same structure, where they consist of two convolution layers with the time embedding tensor injected right after the first convolution operation is performed. So, what we are going to do now is to create another class named DoubleConv to reproduce this structure. Look at the Codeblock 5a and 5b below to see how I do that.

# Codeblock 5a
class DoubleConv(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        super().__init__()
        
        self.conv_0 = nn.Conv2d(in_channels=in_channels,  #(2)
                                out_channels=out_channels, 
                                kernel_size=3, 
                                bias=False, 
                                padding=1)
        self.bn_0 = nn.BatchNorm2d(num_features=out_channels)  #(3)
        
        self.time_embedding = TimeEmbedding()  #(4)
        self.linear = nn.Linear(in_features=TIME_EMBED_DIM,  #(5)
                                out_features=out_channels)
        
        self.conv_1 = nn.Conv2d(in_channels=out_channels,  #(6)
                                out_channels=out_channels, 
                                kernel_size=3, 
                                bias=False, 
                                padding=1)
        self.bn_1 = nn.BatchNorm2d(num_features=out_channels)  #(7)
        
        self.relu = nn.ReLU(inplace=True)  #(8)

The two inputs of the __init__() method above gives us flexibility to configure the number of input and output channels (#(1)) so that the DoubleConv class can be used to instantiate all the five yellow boxes simply by adjusting its input arguments. As the name suggests, here we initialize two convolution layers (line #(2) and #(6)), each followed by a batch normalization layer and a ReLU activation function. Keep in mind that the two normalization layers need to be initialized separately (line #(3) and #(7)) since each of them has their own trainable normalization parameters. Meanwhile, the ReLU activation function should only be initialized once (#(8)) because it contains no parameters, allowing it to be used multiple times in different parts of the network. At line #(4), we initialize the TimeEmbedding layer we created earlier, which will later be connected to a standard linear layer (#(5)). This linear layer is responsible to adjust the dimension of the time embedding tensor so that the resulting output can be summed with the output from the first convolution layer in an element-wise manner.

Now let’s take a look at the Codeblock 5b below to better understand the flow of the DoubleConv block. Here you can see that the forward() method accepts two inputs: the raw image x and the timestep information t as shown at line #(1). We initially process the image with the first Conv-BN-ReLU sequence (#(2–4)). This Conv-BN-ReLU structure is typically used when working with CNN-based models, even if the illustration does not explicitly show the batch normalization and the ReLU layers. Apart from the image, we then take the t-th timestep information from our embedding tensor of the corresponding image (#(5)) and pass it through the linear layer (#(6)). We still need to expand the dimension of the resulting tensor using the code at line #(7) before performing element-wise summation at line #(8). Finally, we process the resulting tensor with the second Conv-BN-ReLU sequence (#(9–11)).

# Codeblock 5b
    def forward(self, x, t):  #(1)
        print(f'images\t\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        
        x = self.conv_0(x)  #(2)
        x = self.bn_0(x)    #(3)
        x = self.relu(x)    #(4)
        print(f'\nafter first conv\t: {x.size()}')
        
        time_embed = self.time_embedding()[t]      #(5)
        print(f'\ntime_embed\t\t: {time_embed.size()}')
        
        time_embed = self.linear(time_embed)       #(6)
        print(f'time_embed after linear\t: {time_embed.size()}')
        
        time_embed = time_embed[:, :, None, None]  #(7)
        print(f'time_embed expanded\t: {time_embed.size()}')
        
        x = x + time_embed  #(8)
        print(f'\nafter summation\t\t: {x.size()}')
        
        x = self.conv_1(x)  #(9)
        x = self.bn_1(x)    #(10)
        x = self.relu(x)    #(11)
        print(f'after second conv\t: {x.size()}')
        
        return x

To see if our DoubleConv implementation works properly, we are going to test it with the Codeblock 6 below. Here I want to simulate the very first instance of this block, which corresponds to the leftmost yellow box in Figure 5. To do so, we need to we need to set the in_channels and out_channels parameters to 1 and 64, respectively (#(1)). Next, we initialize two input tensors, namely x_test and t_test. The x_test tensor has the size of 2×1×28×28, representing a batch of two grayscale images having the size of 28×28 (#(2)). Keep in mind that this is just a dummy tensor of random values which will be replaced with the actual images from MNIST dataset later in the training phase. Meanwhile, t_test is a tensor containing the timestep numbers of the corresponding images (#(3)). The values for this tensor are randomly selected between 0 and NUM_TIMESTEPS (1000). Note that the datatype of this tensor must be an integer since the numbers will be used for indexing, as shown at line #(5) back in Codeblock 5b. Lastly, at line #(4) we pass both x_test and t_test tensors to the double_conv_test layer.

By the way, I re-run the previous codeblocks with the print() functions removed prior to running the following code so that the outputs will look neater.

# Codeblock 6
double_conv_test = DoubleConv(in_channels=1, out_channels=64).to(DEVICE)  #(1)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)  #(2)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)  #(3)

out_test = double_conv_test(x_test, t_test)  #(4)

# Codeblock 6 Output
images                  : torch.Size([2, 1, 28, 28])   #(1)
timesteps               : torch.Size([2]), tensor([468, 304], device='cuda:0')  #(2)

after first conv        : torch.Size([2, 64, 28, 28])  #(3)

time_embed              : torch.Size([2, 32])          #(4)
time_embed after linear : torch.Size([2, 64])
time_embed expanded     : torch.Size([2, 64, 1, 1])    #(5)

after summation         : torch.Size([2, 64, 28, 28])  #(6)
after second conv       : torch.Size([2, 64, 28, 28])  #(7)

The shape of our original input tensors can be seen at lines #(1) and #(2) in the above output. Specifically at line #(2), I also print out the two timesteps that we selected randomly. In this example we assume that each of the two images in the x tensor are already noised with the noise level from 468-th and 304-th timesteps prior to being fed into the network. We can see that the shape of the image tensor x changes to 2×64×28×28 after being passed through the first convolution layer (#(3)). Meanwhile, the size of our time embedding tensor becomes 2×32 (#(4)), which is obtained by extracting rows 468 and 304 from the original embedding of size 1000×32. In order to allow element-wise summation to be performed (#(6)), we need to map the 32-dimensional time embedding vectors into 64 and expand their axes, resulting in a tensor of size 2×64×1×1 (#(5)) so that it can be broadcast to the 2×64×28×28 tensor. After the summation is done, we then pass the tensor through the second convolution layer, at which point the tensor dimension does not change at all (#(7)).

The U-Net Architecture: Encoder

As we have successfully implemented the DoubleConv block, the next step to do is to implement the so-called DownSample block. In Figure 6 below, this corresponds to the parts enclosed in the red box.

Figure 6. The parts of the network highlighted in red are the so-called DownSample blocks [3].

The purpose of a DownSample block is to reduce the spatial dimension of an image, but it is important to note that at the same time it increases the number of channels. In order to achieve this, we can simply stack a DoubleConv block and a maxpooling operation. In this case the pooling uses 2×2 kernel size with the stride of 2, causing the spatial dimension of the image to be twice as small as the input. The implementation of this block can be seen in Codeblock 7 below.

# Codeblock 7
class DownSample(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        super().__init__()
        
        self.double_conv = DoubleConv(in_channels=in_channels,  #(2)
                                      out_channels=out_channels)
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)    #(3)
    
    def forward(self, x, t):  #(4)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        
        convolved = self.double_conv(x, t)   #(5)
        print(f'\nafter double conv\t: {convolved.size()}')
        
        maxpooled = self.maxpool(convolved)  #(6)
        print(f'after pooling\t\t: {maxpooled.size()}')
        
        return convolved, maxpooled          #(7)

Here I set the __init__() method to take number of input and output channels so that we can use it for creating the two DownSample blocks highlighted in Figure 6 without needing to write them in separate classes (#(1)). Next, the DoubleConv and the maxpooling layers are initialized at line #(2) and #(3), respectively. Remember that since the DoubleConv block accepts image x and the corresponding timestep t as the inputs, we also need to set the forward() method of this DownSample block such that it accepts both of them as well (#(4)). The information contained in x and t are then combined as the two tensors are processed by the double_conv layer, which the output is stored in the variable named convolved (#(5)). Afterwards, we now actually perform the downsampling with the maxpooling operation at line #(6), producing a tensor named maxpooled. It is important to note that both the convolved and maxpooled tensors are going to be returned, which is essentially done because we will later bring maxpooled to the next downsampling stage, whereas the convolved tensor will be transferred directly to the upsampling stage in the decoder through skip-connections.

Now let’s test the DownSample class using the Codeblock 8 below. The input tensors used here are exactly the same as the ones in Codeblock 6. Based on the resulting output, we can see that the pooling operation successfully converted the output of the DoubleConv block from 2×64×28×28 (#(1)) to 2×64×14×14 (#(2)), indicating that our DownSample class works properly.

# Codeblock 8
down_sample_test = DownSample(in_channels=1, out_channels=64).to(DEVICE)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)

out_test = down_sample_test(x_test, t_test)

# Codeblock 8 Output
original          : torch.Size([2, 1, 28, 28])
timesteps         : torch.Size([2]), tensor([468, 304], device='cuda:0')

after double conv : torch.Size([2, 64, 28, 28])  #(1)
after pooling     : torch.Size([2, 64, 14, 14])  #(2)

The U-Net Architecture: Decoder

We need to introduce the so-called UpSample block in the decoder, which is responsible for reverting the tensor in the intermediate layers to the original image dimension. In order to maintain a symmetrical structure, the number of UpSample blocks must match that of the DownSample blocks. Look at the Figure 7 below to see where the two UpSample blocks are placed.

Figure 7. The components inside the blue boxes are the so-called UpSample blocks [3].

Since both UpSample blocks are structurally identical, we can just initialize a single class for them, just like the DownSample class we created earlier. Look at the Codeblock 9 below to see how I implement it.

# Codeblock 9
class UpSample(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        
        self.conv_transpose = nn.ConvTranspose2d(in_channels=in_channels,  #(1)
                                                 out_channels=out_channels, 
                                                 kernel_size=2, stride=2)  #(2)
        self.double_conv = DoubleConv(in_channels=in_channels,  #(3)
                                      out_channels=out_channels)
        
    def forward(self, x, t, connection):  #(4)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        print(f'connection\t\t: {connection.size()}')
        
        x = self.conv_transpose(x)  #(5)
        print(f'\nafter conv transpose\t: {x.size()}')
        
        x = torch.cat([x, connection], dim=1)  #(6)
        print(f'after concat\t\t: {x.size()}')
        
        x = self.double_conv(x, t)  #(7)
        print(f'after double conv\t: {x.size()}')
        
        return x

In the __init__() method, we use nn.ConvTranspose2d to upsample the spatial dimension (#(1)). Both the kernel size and stride are set to 2 so that the output will be twice as large (#(2)). Next, the DoubleConv block will be employed to reduce the number of channels, while at the same time combining the timestep information from the time embedding tensor (#(3)).

The flow of this UpSample class is a bit more complicated than the DownSample class. If we take a closer look at the architecture, we’ll see that that we also have a skip-connection coming directly from the encoder. Thus, we need the forward() method to accept another argument in addition to the original image x and the timestep t, namely the residual tensor connection (#(4)). The first thing we do inside this method is to process the original image x with the transpose convolution layer (#(5)). In fact, not only upsampling the spatial size, but this layer also reduces the number of channels at the same time. However, the resulting tensor is then directly concatenated with connection in a channel-wise manner (#(6)), causing it to seem like no channel reduction is performed. It is important to know that at this point these two tensors are just concatenated, meaning that the information from the two are not yet combined. We finally feed these concatenated tensors to the double_conv layer (#(7)), allowing them to share information to each other through the learnable parameters inside the convolution layers.

The Codeblock 10 below shows how I test the UpSample class. The size of the tensors to be passed through are set according to the second upsampling block, i.e., the rightmost blue box in Figure 7.

# Codeblock 10
up_sample_test = UpSample(in_channels=128, out_channels=64).to(DEVICE)

x_test = torch.randn((BATCH_SIZE, 128, 14, 14)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)
connection_test = torch.randn((BATCH_SIZE, 64, 28, 28)).to(DEVICE)

out_test = up_sample_test(x_test, t_test, connection_test)

In the resulting output below, if we compare the input tensor (#(1)) with the final tensor shape (#(2)), we can clearly see that the number of channels successfully reduced from 128 to 64, while at the same time the spatial dimension increased from 14×14 to 28×28. This essentially means that our UpSample class is now ready to be used in the main U-Net architecture.

# Codeblock 10 Output
original             : torch.Size([2, 128, 14, 14])   #(1)
timesteps            : torch.Size([2]), tensor([468, 304], device='cuda:0')
connection           : torch.Size([2, 64, 28, 28])

after conv transpose : torch.Size([2, 64, 28, 28])
after concat         : torch.Size([2, 128, 28, 28])
after double conv    : torch.Size([2, 64, 28, 28])    #(2)

The U-Net Architecture: Putting All Components Together

Once all U-Net components have been created, what we are going to do next is to wrap them together into a single class. Look at the Codeblock 11a and 11b below for the details.

# Codeblock 11a
class UNet(nn.Module):
    def __init__(self):
        super().__init__()
      
        self.downsample_0 = DownSample(in_channels=NUM_CHANNELS,  #(1)
                                       out_channels=64)
        self.downsample_1 = DownSample(in_channels=64,            #(2)
                                       out_channels=128)
      
        self.bottleneck   = DoubleConv(in_channels=128,           #(3)
                                       out_channels=256)
      
        self.upsample_0   = UpSample(in_channels=256,             #(4)
                                     out_channels=128)
        self.upsample_1   = UpSample(in_channels=128,             #(5)
                                     out_channels=64)
      
        self.output = nn.Conv2d(in_channels=64,                   #(6)
                                out_channels=NUM_CHANNELS,
                                kernel_size=1)

You can see in the __init__() method above that we initialize two downsampling (#(1–2)) and two upsampling (#(4–5)) blocks, which the number of input and output channels are set according to the architecture shown in the illustration. There are actually two additional components I haven’t explained yet, namely the bottleneck (#(3)) and the output layer (#(6)). The former is essentially just a DoubleConv block, which acts as the main connection between the encoder and the decoder. Look at the Figure 8 below to see which components of the network belong to the bottleneck layer. Next, the output layer is a standard convolution layer which is responsible to turn the 64-channel image produced by the last UpSampling stage into 1-channel only. This operation is done using a kernel of size 1×1, meaning that it combines information across all channels while operating independently at each pixel position.

Figure 8. The bottleneck layer (the lower part of the model) acts as the main bridge between the encoder and the decoder of U-Net [3].

I guess the forward() method of the entire U-Net in the following codeblock is pretty straightforward, as what we essentially do here is pass the tensors from one layer to another — just don’t forget to include the skip connections between the downsampling and upsampling blocks.

# Codeblock 11b
    def forward(self, x, t):  #(1)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
            
        convolved_0, maxpooled_0 = self.downsample_0(x, t)
        print(f'\nmaxpooled_0\t\t: {maxpooled_0.size()}')
            
        convolved_1, maxpooled_1 = self.downsample_1(maxpooled_0, t)
        print(f'maxpooled_1\t\t: {maxpooled_1.size()}')
            
        x = self.bottleneck(maxpooled_1, t)
        print(f'after bottleneck\t: {x.size()}')
    
        upsampled_0 = self.upsample_0(x, t, convolved_1)
        print(f'upsampled_0\t\t: {upsampled_0.size()}')
            
        upsampled_1 = self.upsample_1(upsampled_0, t, convolved_0)
        print(f'upsampled_1\t\t: {upsampled_1.size()}')
            
        x = self.output(upsampled_1)
        print(f'final output\t\t: {x.size()}')
            
        return x

Now let’s see whether we have correctly constructed the U-Net class above by running the following testing code.

# Codeblock 12
unet_test = UNet().to(DEVICE)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)

out_test = unet_test(x_test, t_test)

# Codeblock 12 Output
original         : torch.Size([2, 1, 28, 28])   #(1)
timesteps        : torch.Size([2]), tensor([468, 304], device='cuda:0')

maxpooled_0      : torch.Size([2, 64, 14, 14])  #(2)
maxpooled_1      : torch.Size([2, 128, 7, 7])   #(3)
after bottleneck : torch.Size([2, 256, 7, 7])   #(4)
upsampled_0      : torch.Size([2, 128, 14, 14])
upsampled_1      : torch.Size([2, 64, 28, 28])
final output     : torch.Size([2, 1, 28, 28])   #(5)

We can see in the above output that the two downsampling stages successfully converted the original tensor of size 1×28×28 (#(1)) into 64×14×14 (#(2)) and 128×7×7 (#(3)), respectively. This tensor is then passed through the bottleneck layer, causing its number of channels to expand to 256 without changing the spatial dimension (#(4)). Lastly, we upsample the tensor twice before eventually shrinking the number of channels to 1 (#(5)). Based on this output, it looks like our model is working properly. Thus, it is now ready to be trained for our diffusion task.

Dataset Preparation

As we have successfully created the entire U-Net architecture, the next thing to do is to prepare the MNIST Handwritten Digit dataset. Before actually loading it, we need to define the preprocessing steps first using the transforms.Compose() method from Torchvision, as shown at line #(1) in Codeblock 13. There are two things we do here: converting the images into PyTorch tensors which also scales the pixel values from 0–255 to 0–1 (#(2)), and normalize them so that the final pixel values ranging between -1 and 1 (#(3)). Next, we download the dataset using datasets.MNIST(). In this case, we are going to take the images from the training data, hence we use train=True (#(5)). Don’t forget to pass the transform variable we initialized earlier to the transform parameter (transform=transform) so that it will automatically preprocess the images as we load them (#(6)). Lastly, we need to employ DataLoader to load the images from mnist_dataset (#(7)). The arguments I use for the input parameters are intended to randomly pick BATCH_SIZE (2) images from the dataset in each iteration.

# Codeblock 13
transform = transforms.Compose([  #(1)
    transforms.ToTensor(),        #(2)
    transforms.Normalize((0.5,), (0.5,))  #(3)
])

mnist_dataset = datasets.MNIST(   #(4)
    root='./data', 
    train=True,           #(5)
    download=True, 
    transform=transform   #(6)
)

loader = DataLoader(mnist_dataset,  #(7)
                    batch_size=BATCH_SIZE,
                    drop_last=True, 
                    shuffle=True)

In the following codeblock, I try to load a batch of images from the dataset. In every iteration, loader provides both the images and the corresponding labels, hence we need to store them in two separate variables: images and labels.

# Codeblock 14
images, labels = next(iter(loader))

print('images\t\t:', images.shape)
print('labels\t\t:', labels.shape)
print('min value\t:', images.min())
print('max value\t:', images.max())

We can see in the resulting output below that the images tensor has the size of 2×1×28×28 (#(1)), indicating that two grayscale images of size 28×28 have been successfully loaded. Here we can also see that the length of the labels tensor is 2, which matches the number of the loaded images (#(2)). Note that in this case the labels are going to be completely ignored. My plan here is that I just want the model to generate any number it previously seen from the entire training dataset without even knowing what number it actually is. Lastly, this output also shows that the preprocessing works properly, as the pixel values now range between -1 and 1.

# Codeblock 14 Output
images    : torch.Size([2, 1, 28, 28])  #(1)
labels    : torch.Size([2])             #(2)
min value : tensor(-1.)
max value : tensor(1.)

Run the following code if you want to see what the image we just loaded looks like.

# Codeblock 15   
plt.imshow(images[0].squeeze(), cmap='gray')
plt.show()

Figure 9. Output from Codeblock 15 [3].

Noise Scheduler

In this section we are going to talk about how the forward and backward diffusion are performed, which the process essentially involves adding or removing noise little by little at each timestep. It is necessary to know that we basically want a uniform amount of noise across all timesteps, where in the forward diffusion the image should be completely full of noise exactly at timestep 1000, while in the backward diffusion, we have to get the completely clear image at timestep 0. Hence, we need something to control the noise amount for each timestep. Later in this section, I am going to implement a class named NoiseScheduler to do so. — This will probably be the most mathy section of this article, as I’ll display many equations here. But don’t worry about that since we’ll focus on implementing these equations rather than discussing the mathematical derivations.

Now let’s take a look at the equations in Figure 10 which I will implement in the __init__() method of the NoiseScheduler class below.

Figure 10. The equations we need to implement in the __init__() method of the NoiseScheduler class [3].

# Codeblock 16a
class NoiseScheduler:
    def __init__(self):
        self.betas = torch.linspace(BETA_START, BETA_END, NUM_TIMESTEPS)  #(1)
        self.alphas = 1. - self.betas
        self.alphas_cum_prod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cum_prod = torch.sqrt(self.alphas_cum_prod)
        self.sqrt_one_minus_alphas_cum_prod = torch.sqrt(1. - self.alphas_cum_prod)

The above code works by creating multiple sequences of numbers, all of them are basically controlled by BETA_START (0.0001), BETA_END (0.02), and NUM_TIMESTEPS (1000). The first sequence we need to instantiate is the betas itself, which is done using torch.linspace() (#(1)). What it essentially does is that it generates a 1-dimensional tensor of length 1000 starting from 0.0001 to 0.02, where every single element in this tensor corresponds to a single timestep. The interval between each element is uniform, allowing us to generate uniform amount of noise throughout all timesteps as well. With this betas tensor, we then compute alphas, alphas_cum_prod, sqrt_alphas_cum_prod and sqrt_one_minus_alphas_cum_prod based on the four equations in Figure 10. Later on, these tensors will act as the basis of how the noise is generated or removed during the diffusion process.

Diffusion is normally done in a sequential manner. However, the forward diffusion process is deterministic, hence we can derive the original equation into a closed form so that we can obtain the noise at a specific timestep without having to iteratively add noise from the very beginning. The Figure 11 below shows what the closed form of the forward diffusion looks like, where x₀ represents the original image while epsilon (ϵ) denotes an image made up of random Gaussian noise. We can think of this equation as a weighted combination, where we combine the clear image and the noise according to weights determined by the timestep, resulting in an image with a specific amount of noise.

Figure 11. The closed form of the forward diffusion process [3].

The implementation of this equation can be seen in Codeblock 16b. In this forward_diffusion() method, x₀ and ϵ are denoted as original and noise. Here you need to keep in mind that these two input variables are images, whereas sqrt_alphas_cum_prod_t and sqrt_one_minus_alphas_cum_prod_t are scalars. Thus, we need to adjust the shape of these two scalars (#(1) and #(2)) so that the operation at line #(3) can be performed. The noisy_image variable is going to be the output of this function, which I guess the name is self-explanatory.

# Codeblock 16b
    def forward_diffusion(self, original, noise, t):
        sqrt_alphas_cum_prod_t = self.sqrt_alphas_cum_prod[t]
        sqrt_alphas_cum_prod_t = sqrt_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1)  #(1)
        
        sqrt_one_minus_alphas_cum_prod_t = self.sqrt_one_minus_alphas_cum_prod[t]
        sqrt_one_minus_alphas_cum_prod_t = sqrt_one_minus_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1)  #(2)
        
        noisy_image = sqrt_alphas_cum_prod_t * original + sqrt_one_minus_alphas_cum_prod_t * noise  #(3)
        
        return noisy_image

Now let’s talk about backward diffusion. In fact, this one is a bit more complicated than the forward diffusion since we need three more equations here. Before I give you these equations, let me show you the implementation first. See the Codeblock 16c below.

# Codeblock 16c
    def backward_diffusion(self, current_image, predicted_noise, t):  #(1)
        denoised_image = (current_image - (self.sqrt_one_minus_alphas_cum_prod[t] * predicted_noise)) / self.sqrt_alphas_cum_prod[t]  #(2)
        denoised_image = 2 * (denoised_image - denoised_image.min()) / (denoised_image.max() - denoised_image.min()) - 1  #(3)
        
        current_prediction = current_image - ((self.betas[t] * predicted_noise) / (self.sqrt_one_minus_alphas_cum_prod[t]))  #(4)
        current_prediction = current_prediction / torch.sqrt(self.alphas[t])  #(5)
        
        if t == 0:  #(6)
            return current_prediction, denoised_image
        
        else:
            variance = (1 - self.alphas_cum_prod[t-1]) / (1. - self.alphas_cum_prod[t])  #(7)
            variance = variance * self.betas[t]  #(8)
            sigma = variance ** 0.5
            z = torch.randn(current_image.shape).to(DEVICE)
            current_prediction = current_prediction + sigma*z
            
            return current_prediction, denoised_image

Later in the inference phase, the backward_diffusion() method will be called inside a loop that iterates NUM_TIMESTEPS (1000) times, starting from t = 999, continued with t = 998, and so on all the way to t = 0. This function is responsible to remove the noise from the image iteratively based on the current_image (the image produced by the previous denoising step), the predicted_noise (the noise predicted by U-Net in the previous step), and the timestep information t (#(1)). In each iteration, noise removal is done using the equation shown in Figure 12, which in Codeblock 16c, this corresponds to lines #(4-5).

Figure 12. The equation used for removing noise from the image [3].

As long as we haven’t reached t = 0, we will compute the variance based on the equation in Figure 13 (#(7–8)). This variance will then be used to introduce another controlled noise to simulate the stochasticity in the backward diffusion process since the noise removal equation in Figure 12 is a deterministic approximation. This is essentially also the reason that we don’t calculate the variance once we reached t = 0 (#(6)) since we no longer need to add more noise as the image is completely clear already.

Figure 13. The equation used to calculate variance for introducing controlled noise [3].

Different from current_prediction which aims to estimate the image of the previous timestep (xₜ₋₁), the objective of the denoised_image tensor is to reconstruct the original image (x₀). Thanks to these different objectives, we need a separate equation to compute denoised_image, which can be seen in Figure 14 below. The implementation of the equation itself is written at line #(2–3).

Figure 14. The equation for reconstructing the original image [3].

Now let’s test the NoiseScheduler class we created above. In the following codeblock, I instantiate a NoiseScheduler object and print out the attributes associated with it, which are all computed using the equation in Figure 10 based on the values stored in the betas attribute. Remember that the actual length of these tensors is NUM_TIMESTEPS (1000), but here I only print out the first 6 elements.

# Codeblock 17
noise_scheduler = NoiseScheduler()

print(f'betas\t\t\t\t: {noise_scheduler.betas[:6]}')
print(f'alphas\t\t\t\t: {noise_scheduler.alphas[:6]}')
print(f'alphas_cum_prod\t\t\t: {noise_scheduler.alphas_cum_prod[:6]}')
print(f'sqrt_alphas_cum_prod\t\t: {noise_scheduler.sqrt_alphas_cum_prod[:6]}')
print(f'sqrt_one_minus_alphas_cum_prod\t: {noise_scheduler.sqrt_one_minus_alphas_cum_prod[:6]}')

# Codeblock 17 Output
betas                          : tensor([1.0000e-04, 1.1992e-04, 1.3984e-04, 1.5976e-04, 1.7968e-04, 1.9960e-04])
alphas                         : tensor([0.9999, 0.9999, 0.9999, 0.9998, 0.9998, 0.9998])
alphas_cum_prod                : tensor([0.9999, 0.9998, 0.9996, 0.9995, 0.9993, 0.9991])
sqrt_alphas_cum_prod           : tensor([0.9999, 0.9999, 0.9998, 0.9997, 0.9997, 0.9996])
sqrt_one_minus_alphas_cum_prod : tensor([0.0100, 0.0148, 0.0190, 0.0228, 0.0264, 0.0300])

The above output indicates that our __init__() method works as expected. Next, we are going to test the forward_diffusion() method. If you go back to Figure 16b, you will see that forward_diffusion() accepts three inputs: original image, noise image and the timestep number. Let’s just use the image from the MNIST dataset we loaded earlier for the first input (#(1)) and a random Gaussian noise of the exact same size for the second one (#(2)). Run the Codeblock 18 below to see what these two images look like.

# Codeblock 18
image = images[0]  #(1)
noise = torch.randn_like(image)  #(2)

plt.imshow(image.squeeze(), cmap='gray')
plt.show()
plt.imshow(noise.squeeze(), cmap='gray')
plt.show()

Figure 15. The two images to be used as the original (left) and the noise image (right). The one on the left is the same image I showed earlier in Figure 9 [3].

As we already got the image and the noise ready, what we need to do afterwards is to pass them to the forward_diffusion() method alongside the t. I actually tried to run the Codeblock 19 below multiple times with t = 50, 100, 150, and so on up to t = 300. You can see in Figure 16 that the image becomes less clear as the parameter increases. In this case, the image is going to be completely filled by noise when the t is set to 999.

# Codeblock 19
noisy_image_test = noise_scheduler.forward_diffusion(image.to(DEVICE), noise.to(DEVICE), t=50)

plt.imshow(noisy_image_test[0].squeeze().cpu(), cmap='gray')
plt.show()

Figure 16. The result of the forward diffusion process at t=50, 100, 150, and so on until t=300 [3].

Unfortunately, we cannot test the backward_diffusion() method since this process requires us to have our U-Net model trained. So, let’s just skip this part for now. I’ll show you how we can actually use this function later in the inference phase.

Training

As the U-Net model, MNIST dataset, and the noise scheduler are ready, we can now prepare a function for training. Before we do that, I instantiate the model and the noise scheduler in Codeblock 20 below.

# Codeblock 20
model = UNet().to(DEVICE)
noise_scheduler = NoiseScheduler()

The entire training procedure is implemented in the train() function shown in Codeblock 21. Before doing anything, we first initialize the optimizer and the loss function, which in this case we use Adam and MSE, respectively (#(1–2)). What we basically want to do here is to train the model such that it will be able to predict the noise contained in the input image, which later on, the predicted noise will be used as the basis of the denoising process in the backward diffusion stage. To actually train the model, we first need to perform forward diffusion using the code at line #(6). This noising process will be done on the images tensor (#(3)) using the random noise generated at line #(4). Next, we take random number somewhere between 0 and NUM_TIMESTEPS (1000) for the t (#(5)), which is essentially done because we want our model to see images of varying noise levels as an approach to improve generalization. As the noisy images have been generated, we then pass it through the U-Net model alongside the chosen t (#(7)). The input t here is useful for the model as it indicates the current noise level in the image. Lastly, the loss function we initialized earlier is responsible to compute the difference between the actual noise and the predicted noise from the original image (#(8)). So, the objective of this training is basically to make the predicted noise as similar as possible to the noise we generated at line #(4).

# Codeblock 21
def train():
    optimizer = Adam(model.parameters(), lr=LEARNING_RATE)  #(1)
    loss_function = nn.MSELoss()  #(2)
    losses = []
    
    for epoch in range(NUM_EPOCHS):
        print(f'Epoch no {epoch}')
        
        for images, _ in tqdm(loader):
            
            optimizer.zero_grad()

            images = images.float().to(DEVICE)  #(3)
            noise = torch.randn_like(images)  #(4)
            t = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,))  #(5)

            noisy_images = noise_scheduler.forward_diffusion(images, noise, t).to(DEVICE)  #(6)
            predicted_noise = model(noisy_images, t)  #(7)
            loss = loss_function(predicted_noise, noise)  #(8)
            
            losses.append(loss.item())
            loss.backward()
            optimizer.step()

    return losses

Now let’s run the above training function using the codeblock below. Sit back and relax while waiting the training completes. In my case, I used Kaggle Notebook with Nvidia GPU P100 turned on, and it took around 45 minutes to finish.

# Codeblock 22
losses = train()

If we take a look at the loss graph, it seems like our model learned pretty well as the value is generally decreasing over time with a rapid drop at early stages and a more stable (yet still decreasing) trend in the later stages. So, I think we can expect good results later in the inference phase.

# Codeblock 23
plt.plot(losses)

Figure 17. How the loss value decreases as the training goes [3].

Inference

At this point we have already got our model trained, so we can now perform inference on it. Look at the Codeblock 24 below to see how I implement the inference() function.

# Codeblock 24
def inference():

    denoised_images = []  #(1)
    
    with torch.no_grad():  #(2)
        current_prediction = torch.randn((64, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)  #(3)
        
        for i in tqdm(reversed(range(NUM_TIMESTEPS))):  #(4)
            predicted_noise = model(current_prediction, torch.as_tensor(i).unsqueeze(0))  #(5)
            current_prediction, denoised_image = noise_scheduler.backward_diffusion(current_prediction, predicted_noise, torch.as_tensor(i))  #(6)

            if i%100 == 0:  #(7)
                denoised_images.append(denoised_image)
            
        return denoised_images

At the line marked with #(1) I initialize an empty list which will be used to store the denoising result every 100 timesteps (#(7)). This will later allow us to see how the backward diffusion goes. The actual inference process is encapsulated inside torch.no_grad() (#(2)). Remember that in diffusion models we generate images from a completely random noise, which we assume that these images are initially at t = 999. To implement this, we can simply use torch.randn() as shown at line #(3). Here we initialize a tensor of size 64×1×28×28, indicating that we are about to generate 64 images simultaneously. Next, we write a for loop that iterates backwards starting from 999 to 0 (#(4)). Inside this loop, we feed the current image and the timestep as the input for the trained U-Net and let it predict the noise (#(5)). The actual backward diffusion is then performed at line #(6). At the end of the iteration, we should get new images similar to the ones we have in our dataset. Now let’s call the inference() function in the following codeblock.

# Codeblock 25
denoised_images = inference()

As the inference completed, we can now see what the resulting images look like. The Codeblock 26 below is used to display the first 42 images we just generated.

# Codeblock 26
fig, axes = plt.subplots(ncols=7, nrows=6, figsize=(10, 8))

counter = 0

for i in range(6):
    for j in range(7):
        axes[i,j].imshow(denoised_images[-1][counter].squeeze().detach().cpu().numpy(), cmap='gray')  #(1)
        axes[i,j].get_xaxis().set_visible(False)
        axes[i,j].get_yaxis().set_visible(False)
        counter += 1

plt.show()

Figure 18. The images generated by the diffusion model trained on the MNIST Handwritten Digit dataset [3].

If we take a look at the above codeblock, you can see that the indexer of [-1] at line #(1) indicates that we only display the images from the last iteration (which corresponds to timestep 0). This is the reason that the images you see in Figure 18 are all free from noise. I do acknowledge that this might not be the best of a result since not all the generated images are valid digit numbers. — But hey, this instead indicates that these images are not merely duplicates from the original dataset.

Here we can also visualize the backward diffusion process using the Codeblock 27 below. You can see in the resulting output in Figure 19 that we initially start from a complete random noise, which gradually disappears as we move to the right.

# Codeblock 27
fig, axes = plt.subplots(ncols=10, figsize=(24, 8))

sample_no = 0
timestep_no = 0

for i in range(10):
    axes[i].imshow(denoised_images[timestep_no][sample_no].squeeze().detach().cpu().numpy(), cmap='gray')
    axes[i].get_xaxis().set_visible(False)
    axes[i].get_yaxis().set_visible(False)
    timestep_no += 1

plt.show()

Figure 19. What the image looks like at timestep 900, 800, 700 and so on until timestep 0 [3].

Ending

There are plenty of directions you can go from here. First, you might probably need to tweak the parameter configurations in Codeblock 2 if you want better results. Second, it is also possible to modify the U-Net model by implementing attention layers in addition to the stack of convolution layers we used in the downsampling and the upsampling stages. This does not guarantee you to obtain better results especially for a simple dataset like this, but it’s definitely worth trying. Third, you can also try to use a more complex dataset if you want to challenge yourself.

When it comes to practical applications, there are actually lots of things you can do with diffusion models. The simplest one might be for data augmentation. With diffusion model, we can easily generate new images from a specific data distribution. For example, suppose we are working on an image classification project, but the number of images in the classes are imbalanced. To address this problem, it is possible for us to take the images from the minority class and feed them into a diffusion model. By doing so, we can ask the trained diffusion model to generate a number of samples from that class as many as we want.

And well, that’s pretty much everything about the theory and the implementation of diffusion model. Thanks for reading, I hope you learn something new today!

You can access the code used in this project through this link. Here are also the links to my previous articles about Autoencoder, Variational Autoencoder (VAE), Neural Style Transfer (NST), and Transformer.

References

[1] Jascha Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Arxiv. https://arxiv.org/pdf/1503.03585 [Accessed December 27, 2024].

[2] Jonathan Ho et al. Denoising Diffusion Probabilistic Models. Arxiv. https://arxiv.org/pdf/2006.11239 [Accessed December 27, 2024].

[3] Image created originally by author.

[4] Olaf Ronneberger et al. U-Net: Convolutional Networks for Biomedical
Image Segmentation. Arxiv. https://arxiv.org/pdf/1505.04597 [Accessed December 27, 2024].

[5] Yann LeCun et al. The MNIST Database of Handwritten Digits. https://yann.lecun.com/exdb/mnist/ [Accessed December 30, 2024] (Creative Commons Attribution-Share Alike 3.0 license).

[6] Ashish Vaswani et al. Attention Is All You Need. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed September 29, 2024].

The post The Art of Noise appeared first on Towards Data Science.

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{ap}{1+a}\)
B	1	\(\frac{1+ap}{1+a}\)
C	0	\(\frac{ap}{1+a}\)
D	1	\(\frac{1+ap}{1+a}\)
E	0	\(\frac{ap}{1+a}\)

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)

Artificial Intelligence | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Training a Conversational Speech Model

But what does it really take to generate audio?

Preprocessing audio

What is Audio Quantization?

Vector Quantization

Residual Vector Quantization

Acoustic vs Semantic Codebooks

Audio Decoder

In Summary

References and Must-read papers

Learnings from a Machine Learning Engineer — Part 6: The Human Side

Overview

AI/ML Engineer

Understand the business needs

Communication

MLOps team

Subject matter experts

Think like a computer

Stakeholders

End-users

Think like a human

Explainability

Marketing team

Leadership team

Conclusion

Closing remarks

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Is it the same? Is it different?

Can CNNs learn same-different relationships?

Conclusions

Another thing!

Reference

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

Introduction

High-Level Overview: Autonomous Debates with Multiple Agents

Prerequisites and Setup

Under the Hood: State Management and Workflow Setup

Defining the Debate State Schema (debate_state.py)

Constants and Configuration

Building the LangGraph Workflow (debate_workflow.py)

Agent Nodes Breakdown

BaseComponent – A Reusable Agent Base Class

Topic Generator Agent (GenerateTopicNode)

Debater Agents (Pro and Con)

Fact Checker Agent (FactCheckNode)

Debate Moderator Agent (DebateModeratorNode)

Fact Check Router (FactCheckRouterNode)

Judge Agent (JudgeNode)

Langsmith Tracing

Reflections and Next Steps

References

Why CatBoost Works So Well: The Engineering Behind the Magic

Target Statistic

Greedy Target Statistic

Leave One Out Target Statistic

Ordered Target Statistic

Ordered Boosting

Building a Tree

Split Candidates

Oblivious Trees

Conclusion

Further Reading

Circuit Tracing: A Step Closer to Understanding Large Language Models

Context

What is a circuit in LLMs?

Technical Setup

Transcoders

Construct a replacement model

Interpretable presentation of model’s computation: Attribution graph

Feature interpretability using an attribution graph

Limitations of the current approach

Conclusion

References:

Creating an AI Agent to Write Blog Posts with CrewAI

Introduction

What is a Crew?

Code

Folder Structure

Defining the Debate State Schema (`debate_state.py`)

Building the LangGraph Workflow (`debate_workflow.py`)

`BaseComponent` – A Reusable Agent Base Class

Topic Generator Agent (`GenerateTopicNode`)

Fact Checker Agent (`FactCheckNode`)

Debate Moderator Agent (`DebateModeratorNode`)

Fact Check Router (`FactCheckRouterNode`)

Judge Agent (`JudgeNode`)