Deep Learning | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Avishek Biswas — Sat, 12 Apr 2025 01:09:27 +0000

Recently, Sesame AI published a demo of their latest Speech-to-Speech model. A conversational AI agent who is really good at speaking, they provide relevant answers, they speak with expressions, and honestly, they are just very fun and interactive to play with.

Note that a technical paper is not out yet, but they do have a short blog post that provides a lot of information about the techniques they used and previous algorithms they built upon.

Thankfully, they provided enough information for me to write this article and make a YouTube video out of it. Read on!

Training a Conversational Speech Model

Sesame is a Conversational Speech Model, or a CSM. It inputs both text and audio, and generates speech as audio. While they haven’t revealed their training data sources in the articles, we can still try to take a solid guess. The blog post heavily cites another CSM, 2024’s Moshi, and fortunately, the creators of Moshi did reveal their data sources in their paper. Moshi uses 7 million hours of unsupervised speech data, 170 hours of natural and scripted conversations (for multi-stream training), and 2000 more hours of telephone conversations (The Fischer Dataset).

Sesame builds upon the Moshi Paper (2024)

But what does it really take to generate audio?

In raw form, audio is just a long sequence of amplitude values — a waveform. For example, if you’re sampling audio at 24 kHz, you are capturing 24,000 float values every second.

There are 24000 values here to represent 1 second of speech! (Image generated by author)

Of course, it is quite resource-intensive to process 24000 float values for just one second of data, especially because transformer computations scale quadratically with sequence length. It would be great if we could compress this signal and reduce the number of samples required to process the audio.

We will take a deep dive into the Mimi encoder and specifically Residual Vector Quantizers (RVQ), which are the backbone of Audio/Speech modeling in Deep Learning today. We will end the article by learning about how Sesame generates audio using its special dual-transformer architecture.

Preprocessing audio

Compression and feature extraction are where convolution helps us. Sesame uses the Mimi speech encoder to process audio. Mimi was introduced in the aforementioned Moshi paper as well. Mimi is a self-supervised audio encoder-decoder model that converts audio waveforms into discrete “latent” tokens first, and then reconstructs the original signal. Sesame only uses the encoder section of Mimi to tokenize the input audio tokens. Let’s learn how.

Mimi inputs the raw speech waveform at 24Khz, passes them through several strided convolution layers to downsample the signal, with a stride factor of 4, 5, 6, 8, and 2. This means that the first CNN block downsamples the audio by 4x, then 5x, then 6x, and so on. In the end, it downsamples by a factor of 1920, reducing it to just 12.5 frames per second.

The convolution blocks also project the original float values to an embedding dimension of 512. Each embedding aggregates the local features of the original 1D waveform. 1 second of audio is now represented as around 12 vectors of size 512. This way, Mimi reduces the sequence length from 24000 to just 12 and converts them into dense continuous vectors.

Before applying any quantization, the Mimi Encoder downsamples the input 24KHz audio by 1920 times, and embeds it into 512 dimensions. In other words, you get 12.5 frames per second with each frame as a 512-dimensional vector. (Image from author’s video)

What is Audio Quantization?

Given the continuous embeddings obtained after the convolution layer, we want to tokenize the input speech. If we can represent speech as a sequence of tokens, we can apply standard language learning transformers to train generative models.

Mimi uses a Residual Vector Quantizer or RVQ tokenizer to achieve this. We will talk about the residual part soon, but first, let’s look at what a simple vanilla Vector quantizer does.

Vector Quantization

The idea behind Vector Quantization is simple: you train a codebook , which is a collection of, say, 1000 random vector codes all of size 512 (same as your embedding dimension).

A Vanilla Vector Quantizer. A codebook of embeddings is trained. Given an input embedding, we map/quantize it to the nearest codebook entry. (Screenshot from author’s video)

Then, given the input vector, we will map it to the closest vector in our codebook — basically snapping a point to its nearest cluster center. This means we have effectively created a fixed vocabulary of tokens to represent each audio frame, because whatever the input frame embedding may be, we will represent it with the nearest cluster centroid. If you want to learn more about Vector Quantization, check out my video on this topic where I go much deeper with this.

More about Vector Quantization! (Video by author)

Residual Vector Quantization

The problem with simple vector quantization is that the loss of information may be too high because we are mapping each vector to its cluster’s centroid. This “snap” is rarely perfect, so there is always an error between the original embedding and the nearest codebook.

The big idea of Residual Vector Quantization is that it doesn’t stop at having just one codebook. Instead, it tries to use multiple codebooks to represent the input vector.

First, you quantize the original vector using the first codebook.
Then, you subtract that centroid from your original vector. What you’re left with is the residual — the error that wasn’t captured in the first quantization.
Now take this residual, and quantize it again, using a second codebook full of brand new code vectors — again by snapping it to the nearest centroid.
Subtract that too, and you get a smaller residual. Quantize again with a third codebook… and you can keep doing this for as many codebooks as you want.

Residual Vector Quantizers (RVQ) hierarchically encode the input embeddings by using a new codebook and VQ layer to represent the previous codebook’s error. (Illustration by the author)

Each step hierarchically captures a little more detail that was missed in the previous round. If you repeat this for, let’s say, N codebooks, you get a collection of N discrete tokens from each stage of quantization to represent one audio frame.

The coolest thing about RVQs is that they are designed to have a high inductive bias towards capturing the most essential content in the very first quantizer. In the subsequent quantizers, they learn more and more fine-grained features.

If you’re familiar with PCA, you can think of the first codebook as containing the primary principal components, capturing the most critical information. The subsequent codebooks represent higher-order components, containing information that adds more details.

Residual Vector Quantizers (RVQ) uses multiple codebooks to encode the input vector — one entry from each codebook. (Screenshot from author’s video)

Acoustic vs Semantic Codebooks

Since Mimi is trained on the task of audio reconstruction, the encoder compresses the signal to the discretized latent space, and the decoder reconstructs it back from the latent space. When optimizing for this task, the RVQ codebooks learn to capture the essential acoustic content of the input audio inside the compressed latent space.

Mimi also separately trains a single codebook (vanilla VQ) that only focuses on embedding the semantic content of the audio. This is why Mimi is called a split-RVQ tokenizer – it divides the quantization process into two independent parallel paths: one for semantic information and another for acoustic information.

The Mimi Architecture (Source: Moshi paper) License: Free

To train semantic representations, Mimi used knowledge distillation with an existing speech model called WavLM as a semantic teacher. Basically, Mimi introduces an additional loss function that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.

Audio Decoder

Given a conversation containing text and audio, we first convert them into a sequence of token embeddings using the text and audio tokenizers. This token sequence is then input into a transformer model as a time series. In the blog post, this model is referred to as the Autoregressive Backbone Transformer. Its task is to process this time series and output the “zeroth” codebook token.

A lighterweight transformer called the audio decoder then reconstructs the next codebook tokens conditioned on this zeroth code generated by the backbone transformer. Note that the zeroth code already contains a lot of information about the history of the conversation since the backbone transformer has visibility of the entire past sequence. The lightweight audio decoder only operates on the zeroth token and generates the other N-1 codes. These codes are generated by using N-1 distinct linear layers that output the probability of choosing each code from their corresponding codebooks.

You can imagine this process as predicting a text token from the vocabulary in a text-only LLM. Just that a text-based LLM has a single vocabulary, but the RVQ-tokenizer has multiple vocabularies in the form of the N codebooks, so you need to train a separate linear layer to model the codes for each.

The Sesame Architecture (Illustration by the author)

Finally, after the codewords are all generated, we aggregate them to form the combined continuous audio embedding. The final job is to convert this audio back to a waveform. For this, we apply transposed convolutional layers to upscale the embedding back from 12.5 Hz back to KHz waveform audio. Basically, reversing the transforms we had applied originally during audio preprocessing.

In Summary

Check out the accompanying video on this article! (Video by author)

So, here is the overall summary of the Sesame model in some bullet points.

Sesame is built on a multimodal Conversation Speech Model or a CSM.
Text and audio are tokenized together to form a sequence of tokens and input into the backbone transformer that autoregressively processes the sequence.
While the text is processed like any other text-based LLM, the audio is processed directly from its waveform representation. They use the Mimi encoder to convert the waveform into latent codes using a split RVQ tokenizer.
The multimodal backbone transformers consume a sequence of tokens and predict the next zeroth codeword.
Another lightweight transformer called the Audio Decoder predicts the next codewords from the zeroth codeword.
The final audio frame representation is generated from combining all the generated codewords and upsampled back to the waveform representation.

Thanks for reading!

References and Must-read papers

Check out my ML YouTube Channel

Sesame Blogpost and Demo

Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692

The post Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech appeared first on Towards Data Science.

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Salvatore Raieli — Fri, 11 Apr 2025 05:44:46 +0000

Liberating education consists in acts of cognition, not transferrals of information.
Paulo freire

One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing?

Many authors suggest that artificial intelligence models do not possess the same capabilities as humans, especially when it comes to plasticity, flexibility, and adaptation.

One of the aspects that models do not capture are several causal relationships about the external world.

This article discusses these issues:

The parallelism between convolutional neural networks (CNNs) and the human visual cortex
Limitations of CNNs in understanding causal relations and learning abstract concepts
How to make CNNs learn simple causal relations

Is it the same? Is it different?

Convolutional networks (CNNs) [2] are multi-layered neural networks that take images as input and can be used for multiple tasks. One of the most fascinating aspects of CNNs is their inspiration from the human visual cortex [1]:

Hierarchical processing. The visual cortex processes images hierarchically, where early visual areas capture simple features (such as edges, lines, and colors) and deeper areas capture more complex features such as shapes, objects, and scenes. CNN, due to its layered structure, captures edges and textures in the early layers, while layers further down capture parts or whole objects.
Receptive fields. Neurons in the visual cortex respond to stimuli in a specific local region of the visual field (commonly called receptive fields). As we go deeper, the receptive fields of the neurons widen, allowing more spatial information to be integrated. Thanks to pooling steps, the same happens in CNNs.
Feature sharing. Although biological neurons are not identical, similar features are recognized across different parts of the visual field. In CNNs, the various filters scan the entire image, allowing patterns to be recognized regardless of location.
Spatial invariance. Humans can recognize objects even when they are moved, scaled, or rotated. CNNs also possess this property.

The relationship between components of the visual system and CNN. Image source: here

These features have made CNNs perform well in visual tasks to the point of superhuman performance:

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well-trained on the validation images to be better aware of the existence of relevant classes. […] Our result (4.94%) exceeds the reported human-level performance. —source [3]

Although CNNs perform better than humans in several tasks, there are still cases where they fail spectacularly. For example, in a 2024 study [4], AI models failed to generalize image classification. State-of-the-art models perform better than humans for objects on upright poses but fail when objects are on unusual poses.

The right label is on the top of the object, and the AI wrong predicted label is below. Image source: here

In conclusion, our results show that (1) humans are still much more robust than most networks at recognizing objects in unusual poses, (2) time is of the essence for such ability to emerge, and (3) even time-limited humans are dissimilar to deep neural networks. —source [4]

In the study [4], they note that humans need time to succeed in a task. Some tasks require not only visual recognition but also abstractive cognition, which requires time.

The generalization abilities that make humans capable come from understanding the laws that govern relations among objects. Humans recognize objects by extrapolating rules and chaining these rules to adapt to new situations. One of the simplest rules is the “same-different relation”: the ability to define whether two objects are the same or different. This ability develops rapidly during infancy and is also importantly associated with language development [5-7]. In addition, some animals such as ducks and chimpanzees also have it [8]. In contrast, learning same-different relations is very difficult for neural networks [9-10].

Example of a same-different task for a CNN. The network should return a label of 1 if the two objects are the same or a label of 0 if they are different. Image source: here

Convolutional networks show difficulty in learning this relationship. Likewise, they fail to learn other types of causal relationships that are simple for humans. Therefore, many researchers have concluded that CNNs lack the inductive bias necessary to be able to learn these relationships.

These negative results do not mean that neural networks are completely incapable of learning same-different relations. Much larger and longer trained models can learn this relation. For example, vision-transformer models pre-trained on ImageNet with contrastive learning can show this ability [12].

Can CNNs learn same-different relationships?

The fact that broad models can learn these kinds of relationships has rekindled interest in CNNs. The same-different relationship is considered among the basic logical operations that make up the foundations for higher-order cognition and reasoning. Showing that shallow CNNs can learn this concept would allow us to experiment with other relationships. Moreover, it will allow models to learn increasingly complex causal relationships. This is an important step in advancing the generalization capabilities of AI.

Previous work suggests that CNNs do not have the architectural inductive biases to be able to learn abstract visual relations. Other authors assume that the problem is in the training paradigm. In general, the classical gradient descent is used to learn a single task or a set of tasks. Given a task t or a set of tasks T, a loss function L is used to optimize the weights φ that should minimize the function L:

Image source from here

This can be viewed as simply the sum of the losses across different tasks (if we have more than one task). Instead, the Model-Agnostic Meta-Learning (MAML) algorithm [13] is designed to search for an optimal point in weight space for a set of related tasks. MAML seeks to find an initial set of weights θ that minimizes the loss function across tasks, facilitating rapid adaptation:

Image source from here

The difference may seem small, but conceptually, this approach is directed toward abstraction and generalization. If there are multiple tasks, traditional training tries to optimize weights for different tasks. MAML tries to identify a set of weights that is optimal for different tasks but at the same time equidistant in the weight space. This starting point θ allows the model to generalize more effectively across different tasks.

Meta-learning initial weights for generalization. Image source from here

Since we now have a method biased toward generalization and abstraction, we can test whether we can make CNNs learn the same-different relationship.

In this study [11], they compared shallow CNNs trained with classic gradient descent and meta-learning on a dataset designed for this report. The dataset consists of 10 different tasks that test for the same-different relationship.

The Same-Different dataset. Image source from here

The authors [11] compare CNNs of 2, 4, or 6 layers trained in a traditional way or with meta-learning, showing several interesting results:

The performance of traditional CNNs shows similar behavior to random guessing.
Meta-learning significantly improves performance, suggesting that the model can learn the same-different relationship. A 2-layer CNN performs little better than chance, but by increasing the depth of the network, performance improves to near-perfect accuracy.

Comparison between traditional training and meta-learning for CNNs. Image source from here

One of the most intriguing results of [11] is that the model can be trained in a leave-one-out way (use 9 tasks and leave one out) and show out-of-distribution generalization capabilities. Thus, the model has learned abstracting behavior that is hardly seen in such a small model (6 layers).

out-of-distribution for same-different classification. Image source from here

Conclusions

Although convolutional networks were inspired by how the human brain processes visual stimuli, they do not capture some of its basic capabilities. This is especially true when it comes to causal relations or abstract concepts. Some of these relationships can be learned from large models only with extensive training. This has led to the assumption that small CNNs cannot learn these relations due to a lack of architecture inductive bias. In recent years, efforts have been made to create new architectures that could have an advantage in learning relational reasoning. Yet most of these architectures fail to learn these kinds of relationships. Intriguingly, this can be overcome through the use of meta-learning.

The advantage of meta-learning is to incentivize more abstractive learning. Meta-learning pressure toward generalization, trying to optimize for all tasks at the same time. To do this, learning more abstract features is favored (low-level features, such as the angles of a particular shape, are not useful for generalization and are disfavored). Meta-learning allows a shallow CNN to learn abstract behavior that would otherwise require many more parameters and training.

The shallow CNNs and same-different relationship are a model for higher cognitive functions. Meta-learning and different forms of training could be useful to improve the reasoning capabilities of the models.

Another thing!

You can look for my other articles on Medium, and you can also connect or reach me on LinkedIn or in Bluesky. Check this repository, which contains weekly updated ML & AI news, or here for other tutorials and here for AI reviews. I am open to collaborations and projects, and you can reach me on LinkedIn.

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Lindsay, 2020, Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future, link
Li, 2020, A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects, link
He, 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, link
Ollikka, 2024, A comparison between humans and AI at recognizing objects in unusual poses, link
Premark, 1981, The codes of man and beasts, link
Blote, 1999, Young children’s organizational strategies on a same–different task: A microgenetic study and a training study, link
Lupker, 2015, Is there phonologically based priming in the same-different task? Evidence from Japanese-English bilinguals, link
Gentner, 2021, Learning same and different relations: cross-species comparisons, link
Kim, 2018, Not-so-clevr: learning same–different relations strains feedforward neural networks, link
Puebla, 2021, Can deep convolutional neural networks support relational reasoning in the same-different task? link
Gupta, 2025, Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation, link
Tartaglini, 2023, Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations, link
Finn, 2017, Model-agnostic meta-learning for fast adaptation of deep networks, link

The post The Basis of Cognitive Complexity: Teaching CNNs to See Connections appeared first on Towards Data Science.

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

Iason Solomos — Thu, 10 Apr 2025 05:14:56 +0000

Introduction

I’ve always been fascinated by debates—the strategic framing, the sharp retorts, and the carefully timed comebacks. Debates aren’t just entertaining; they’re structured battles of ideas, driven by logic and evidence. Recently, I started wondering: could we replicate that dynamic using AI agents—having them debate each other autonomously, complete with real-time fact-checking and moderation? The result was Deb8flow, an autonomous AI debating environment powered by LangGraph, OpenAI’s GPT-4o model, and the new integrated Web Search feature.

In Deb8flow, two agents—Pro and Con—square off on a given topic while a Moderator manages turn-taking. A dedicated Fact Checker reviews every claim in real time using GPT-4o’s new browsing capabilities, and a final Judge evaluates the arguments for quality and coherence. If an agent repeatedly makes factual errors, they’re automatically disqualified—ensuring the debate stays grounded in truth.

This article offers an in-depth look at the advanced architecture and dynamic workflows that power autonomous AI debates. I’ll walk you through how Deb8flow’s modular design leverages LangGraph’s state management and conditional routing, alongside GPT-4o’s capabilities.

Even if you’re new to AI agents or LangGraph (see resources [1] and [2] for primers), I’ll explain the key concepts clearly. And if you’d like to explore further, the full project is available on GitHub: iason-solomos/Deb8flow.

Ready to see how AI agents can debate autonomously in practice?

Let’s dive in.

High-Level Overview: Autonomous Debates with Multiple Agents

In Deb8flow, we orchestrate a formal debate between two AI agents – one arguing Pro and one Con – complete with a Moderator, a Fact Checker, and a final Judge. The debate unfolds autonomously, with each agent playing a role in a structured format.

At its core, Deb8flow is a LangGraph-powered agent system, built atop LangChain, using GPT-4o to power each role—Pro, Con, Judge, and beyond. We use GPT-4o’s preview model with browsing capabilities to enable real-time fact-checking. In essence, the Pro and Con agents debate; after each statement, a fact-checker agent uses GPT-4o’s web search to catch any hallucinations or inaccuracies in that statement in real time. The debate only continues once the statement is verified. The whole process is coordinated by a LangGraph-defined workflow that ensures proper turn-taking and conditional logic.

High-level debate flow graph. Each rectangle is an agent node (Pro/Con debaters, Fact Checker, Judge, etc.), and diamonds are control nodes (Moderator and a router after fact-checking). Solid arrows denote the normal progression, while dashed arrows indicate retries if a claim fails fact-check. The Judge node outputs the final verdict, then the workflow ends.
Image generated by the author with DALL-E

The debate workflow goes through these stages:

Topic Generation: A Topic Generator agent produces a nuanced, debatable topic for the session (e.g. “Should AI be used in classroom education?”).
Opening: The Pro Argument Agent makes an opening statement in favor of the topic, kicking off the debate.
Rebuttal: The Debate Moderator then gives the floor to the Con Argument agent, who rebuts the Pro’s opening statement.
Counter: The Moderator gives the floor back to the Pro agent, who counters the Con agent’s points.
Closing: The Moderator switches the floor to the Con agent one last time for a closing argument.
Judgment: Finally, the Judge agent reviews the full debate history and evaluates both sides based on argument quality, clarity, and persuasiveness. The most convincing side wins.

After every single speech, the Fact Checker agent steps in to verify the factual accuracy of that statement. If a debater’s claim doesn’t hold up (e.g. cites a wrong statistic or “hallucinates” a fact), the workflow triggers a retry: the speaker has to correct or modify their statement. (If either debater accumulates 3 fact-check failures, they are automatically disqualified for repeatedly spreading inaccuracies, and their opponent wins by default.) This mechanism keeps our AI debaters honest and grounded in reality!

Prerequisites and Setup

Before diving into the code, make sure you have the following in place:

Python 3.12+ installed.
An OpenAI API key with access to the GPT-4o model. You can create your own API key here: https://platform.openai.com/settings/organization/api-keys
Project Code: Clone the Deb8flow repository from GitHub (git clone https://github.com/iason-solomos/Deb8flow.git). The repo includes a requirements.txt for all required packages. Key dependencies include LangChain/LangGraph (for building the agent graph) and the OpenAI Python client.
Install Dependencies: In your project directory, run: pip install -r requirements.txt to install the necessary libraries.
Create a .env file in the project root to hold your OpenAI API credentials. It should be of the form: OPENAI_API_KEY_GPT4O = "sk-…"
You can also at any time check out the README file: https://github.com/iason-solomos/Deb8flow if you simply want to run the finished app.

Once dependencies are installed and the environment variable is set, you should be ready to run the app. The project structure is organized for clarity:

Deb8flow/
├── configurations/
│ ├── debate_constants.py
│ └── llm_config.py
├── nodes/
│ ├── base_component.py
│ ├── topic_generator_node.py
│ ├── pro_debater_node.py
│ ├── con_debater_node.py
│ ├── debate_moderator_node.py
│ ├── fact_checker_node.py
│ ├── fact_check_router_node.py
│ └── judge_node.py
├── prompts/
│ ├── topic_generator_prompts.py
│ ├── pro_debater_prompts.py
│ ├── con_debater_prompts.py
│ └── … (prompts for other agents)
├── tests/ (contains unit and whole workflow tests)
└── debate_workflow.py

A quick tour of this structure:

configurations/ holds constant definitions and LLM configuration classes.

nodes/ contains the implementation of each agent or functional node in the debate (each of these is a module defining one agent’s behavior).

prompts/ stores the prompt templates for the language model (so each agent knows how to prompt GPT-4o for its specific task).

debate_workflow.py ties everything together by defining the LangGraph workflow (the graph of nodes and transitions).

debate_state.py defines the shared data structure that the agents will be using on each run.

tests/ includes some basic tests and example runs to help you verify everything is working.

Under the Hood: State Management and Workflow Setup

To coordinate a complex multi-turn debate, we need a shared state and a well-defined flow. We’ll start by looking at how Deb8flow defines the debate state and constants, and then see how the LangGraph workflow is constructed.

Defining the Debate State Schema (`debate_state.py`)

Deb8flow uses a shared state (https://langchain-ai.github.io/langgraph/concepts/low_level/#state ) in the form of a Python TypedDict that all agents can read from and update. This state tracks the debate’s progress and context – things like the topic, the history of messages, whose turn it is, etc. By centralizing this information, each agent node can make decisions based on the current state of the debate.

Link: debate_state.py

from typing import TypedDict, List, Dict, Literal


DebateStage = Literal["opening", "rebuttal", "counter", "final_argument"]

class DebateMessage(TypedDict):
    speaker: str  # e.g. pro or con
    content: str  # The message each speaker produced
    validated: bool  # Whether the FactChecker ok’d this message
    stage: DebateStage # The stage of the debate when this message was produced

class DebateState(TypedDict):
    debate_topic: str
    positions: Dict[str, str]
    messages: List[DebateMessage]
    opening_statement_pro_agent: str
    stage: str  # "opening", "rebuttal", "counter", "final_argument"
    speaker: str  # "pro" or "con"
    times_pro_fact_checked: int # The number of times the pro agent has been fact-checked. If it reaches 3, the pro agent is disqualified.
    times_con_fact_checked: int # The number of times the con agent has been fact-checked. If it reaches 3, the con agent is disqualified.

Key fields that we need to have in the DebateState include:

debate_topic (str): The topic being debated.
messages (List[DebateMessage]): A list of all messages exchanged so far. Each message is a dictionary with fields for speaker (e.g. "pro" or "con" or "fact_checker"), the message content (text), a validated flag (whether it passed fact-check), and the stage of the debate when it was produced.
stage (str): The current debate stage (one of "opening", "rebuttal", "counter", "final_argument").
speaker (str): Whose turn it is currently ("pro" or "con").
times_pro_fact_checked / times_con_fact_checked (int): Counters for how many times each side has been caught with a false claim. (In our rules, if a debater fails fact-check 3 times, they could be disqualified or automatically lose.)
positions (Dict[str, str]): (Optional) A mapping of each side’s general stance (e.g., "pro": "In favor of the topic").

By structuring the debate’s state, agents find it easy to access the conversation history or check the current stage, and the control logic can update the state between turns. The state is essentially the memory of the debate.

Constants and Configuration

To avoid “magic strings” scattered in the code, we define some constants in debate_constants.py. For example, constants for stage names (STAGE_OPENING = "opening", etc.), speaker identifiers (SPEAKER_PRO = "pro", SPEAKER_CON = "con", etc.), and node names (NODE_PRO_DEBATER = "pro_debater_node", etc.). These make the code easier to maintain and read.

debate_constants.py:

# Stage names
STAGE_OPENING = "opening"
STAGE_REBUTTAL = "rebuttal"
STAGE_COUNTER = "counter"
STAGE_FINAL_ARGUMENT = "final_argument"
STAGE_END = "end"

# Speakers
SPEAKER_PRO = "pro"
SPEAKER_CON = "con"
SPEAKER_JUDGE = "judge"

# Node names
NODE_PRO_DEBATER = "pro_debater_node"
NODE_CON_DEBATER = "con_debater_node"
NODE_DEBATE_MODERATOR = "debate_moderator_node"
NODE_JUDGE = "judge_node"

We also set up LLM configuration in llm_config.py. Here, we define classes for OpenAI or Azure OpenAI configs and then create a dictionary llm_config_map mapping model names to their config. For instance, we map "gpt-4o" to an OpenAILLMConfig that holds the model name and API key. This way, whenever we need to initialize a GPT-4o agent, we can just do llm_config_map["gpt-4o"] to get the right config. All our main agents (debaters, topic generator, judge) use this same GPT-4o configuration.

import os
from dataclasses import dataclass
from typing import Union

@dataclass
class OpenAILLMConfig:
    """
    A data class to store configuration details for OpenAI models.

    Attributes:
        model_name (str): The name of the OpenAI model to use.
        openai_api_key (str): The API key for authenticating with the OpenAI service.
    """
    model_name: str
    openai_api_key: str


llm_config_map = {
    "gpt-4o": OpenAILLMConfig(
        model_name="gpt-4o",
        openai_api_key=os.getenv("OPENAI_API_KEY_GPT4O"),
    )
}

Building the LangGraph Workflow (`debate_workflow.py`)

With state and configs in place, we construct the debate workflow graph. LangGraph’s StateGraph is the backbone that connects all our agent nodes in the order they should execute. Here’s how we set it up:

class DebateWorkflow:

    def _initialize_workflow(self) -> StateGraph:
        workflow = StateGraph(DebateState)
        # Nodes
        workflow.add_node("generate_topic_node", GenerateTopicNode(llm_config_map["gpt-4o"]))
        workflow.add_node("pro_debater_node", ProDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("con_debater_node", ConDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("fact_check_node", FactCheckNode())
        workflow.add_node("fact_check_router_node", FactCheckRouterNode())
        workflow.add_node("debate_moderator_node", DebateModeratorNode())
        workflow.add_node("judge_node", JudgeNode(llm_config_map["gpt-4o"]))

        # Entry point
        workflow.set_entry_point("generate_topic_node")

        # Flow
        workflow.add_edge("generate_topic_node", "pro_debater_node")
        workflow.add_edge("pro_debater_node", "fact_check_node")
        workflow.add_edge("con_debater_node", "fact_check_node")
        workflow.add_edge("fact_check_node", "fact_check_router_node")
        workflow.add_edge("judge_node", END)
        return workflow



    async def run(self):
        workflow = self._initialize_workflow()
        graph = workflow.compile()
        # graph.get_graph().draw_mermaid_png(output_file_path="workflow_graph.png")
        initial_state = {
            "topic": "",
            "positions": {}
        }
        final_state = await graph.ainvoke(initial_state, config={"recursion_limit": 50})
        return final_state

Let’s break down what’s happening:

We initialize a new StateGraph with our DebateState type as the state schema.
We add each node (agent) to the graph with a name. For nodes that need an LLM, we pass in the GPT-4o config. For example, "pro_debater_node" is added as ProDebaterNode(llm_config_map["gpt-4o"]), meaning the Pro debater agent will use GPT-4o as its underlying model.
We set the entry point of the graph to "generate_topic_node". This means the first step of the workflow is to generate a debate topic.
Then we add directed edges to connect nodes. The edges above encode the primary sequence: topic -> pro’s turn -> fact-check -> (then a routing decision) -> … eventually -> judge -> END. We don’t connect the Moderator or Fact Check Router with static edges, since these nodes use dynamic commands to redirect the flow. The final edge connects the judge to an END marker to terminate the graph.

When the workflow runs, control will pass along these edges in order, but whenever we hit a router or moderator node, that node will output a command telling the graph which node to go to next (overriding the default edge). This is how we create conditional loops: the fact_check_router_node might send us back to a debater node for a retry, instead of following a straight line. LangGraph supports this by allowing nodes to return a special Command object with goto instructions.

In summary, at a high level we’ve defined an agentic workflow: a graph of autonomous agents where control can branch and loop based on the agents’ outputs. Now, let’s explore what each of these agent nodes actually does.

Agent Nodes Breakdown

Each stage or role in the debate is encapsulated in a node (agent). In LangGraph, nodes are often simple functions, but I wanted a more object-oriented approach for clarity and reusability. So in Deb8flow, every node is a class with a __call__ method. All the main agent classes inherit from a common BaseComponent for shared functionality. This design makes the system modular: we can easily swap out or extend agents by modifying their class definitions, and each agent class is responsible for its piece of the workflow.

Let’s go through the key agents one by one.

`BaseComponent` – A Reusable Agent Base Class

Most of our agent nodes (like the debaters and judge) share common needs: they use an LLM to generate output, they might need to retry on errors, and they should track token usage. The BaseComponent class (defined in nodes/base_component.py) provides these common features so we don’t repeat code.

class BaseComponent:
    """
    A foundational class for managing LLM-based workflows with token tracking.
    Can handle both Azure OpenAI (AzureChatOpenAI) and OpenAI (ChatOpenAI).
    """

    def __init__(
        self,
        llm_config: Optional[LLMConfig] = None,
        temperature: float = 0.0,
        max_retries: int = 5,
    ):
        """
        Initializes the BaseComponent with optional LLM configuration and temperature.

        Args:
            llm_config (Optional[LLMConfig]): Configuration for either Azure or OpenAI.
            temperature (float): Controls the randomness of LLM outputs. Defaults to 0.0.
            max_retries (int): How many times to retry on 429 errors.
        """
        logger = logging.getLogger(self.__class__.__name__)
        tracer = trace.get_tracer(__name__, tracer_provider=get_tracer_provider())

        self.logger = logger
        self.tracer = tracer
        self.llm: Optional[ChatOpenAI] = None
        self.output_parser: Optional[StrOutputParser] = None
        self.state: Optional[DebateState] = None
        self.prompt_template: Optional[ChatPromptTemplate] = None
        self.chain: Optional[RunnableSequence] = None
        self.documents: Optional[List] = None
        self.prompt_tokens = 0
        self.completion_tokens = 0
        self.max_retries = max_retries

        if llm_config is not None:
            self.llm = self._init_llm(llm_config, temperature)
            self.output_parser = StrOutputParser()

    def _init_llm(self, config: LLMConfig, temperature: float):
        """
        Initializes an LLM instance for either Azure OpenAI or standard OpenAI.
        """
        if isinstance(config, AzureOpenAILLMConfig):
            # If it's Azure, use the AzureChatOpenAI class
            return AzureChatOpenAI(
                deployment_name=config.deployment_name,
                azure_endpoint=config.azure_endpoint,
                openai_api_version=config.openai_api_version,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        elif isinstance(config, OpenAILLMConfig):
            # If it's standard OpenAI, use the ChatOpenAI class
            return ChatOpenAI(
                model_name=config.model_name,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        else:
            raise ValueError("Unsupported LLMConfig type.")

    def validate_initialization(self) -> None:
        """
        Ensures we have an LLM and an output parser.
        """
        if not self.llm:
            raise ValueError("LLM is not initialized. Ensure `llm_config` is provided.")
        if not self.output_parser:
            raise ValueError("Output parser is not initialized.")

    def execute_chain(self, inputs: Any) -> Any:
        """
        Executes the LLM chain, tracks token usage, and retries on 429 errors.
        """
        if not self.chain:
            raise ValueError("No chain is initialized for execution.")

        retry_wait = 1  # Initial wait time in seconds

        for attempt in range(self.max_retries):
            try:
                with get_openai_callback() as cb:
                    result = self.chain.invoke(inputs)
                    self.logger.info("Prompt Token usage: %s", cb.prompt_tokens)
                    self.logger.info("Completion Token usage: %s", cb.completion_tokens)
                    self.prompt_tokens = cb.prompt_tokens
                    self.completion_tokens = cb.completion_tokens

                return result

            except Exception as e:
                # If the error mentions 429, do exponential backoff and retry
                if "429" in str(e):
                    self.logger.warning(
                        f"Rate limit reached. Retrying in {retry_wait} seconds... "
                        f"(Attempt {attempt + 1}/{self.max_retries})"
                    )
                    time.sleep(retry_wait)
                    retry_wait *= 2
                else:
                    self.logger.error(f"Unexpected error: {str(e)}")
                    raise e

        raise Exception("API request failed after maximum number of retries")

    def create_chain(
        self, system_template: str, human_template: str
    ) -> RunnableSequence:
        """
        Creates a chain for unstructured outputs.
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm | self.output_parser
        return self.chain

    def create_structured_output_chain(
        self, system_template: str, human_template: str, output_model: Type[BaseModel]
    ) -> RunnableSequence:
        """
        Creates a chain that yields structured outputs (parsed into a Pydantic model).
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm.with_structured_output(output_model)
        return self.chain

    def build_return_with_tokens(self, node_specific_data: dict) -> dict:
        """
        Convenience method to add token usage info into the return values.
        """
        return {
            **node_specific_data,
            "prompt_tokens": self.prompt_tokens,
            "completion_tokens": self.completion_tokens,
        }

    def __call__(self, state: DebateState) -> None:
        """
        Updates the node's local copy of the state.
        """
        self.state = state
        for key, value in state.items():
            setattr(self, key, value)

Key features of BaseComponent:

It stores an LLM client (e.g. an OpenAI ChatOpenAI instance) initialized with a given model and API key, as well as an output parser.
It provides a method create_chain(system_template, human_template) which sets up a LangChain prompt chain (a RunnableSequence) combining a system prompt and a human prompt. This chain is what actually generates outputs when run.
It has an execute_chain(inputs) method that invokes the chain and includes logic to retry if the OpenAI API returns a rate-limit error (HTTP 429). This is done with exponential backoff up to a max_retries count.
It keeps track of token usage (prompt tokens and completion tokens) for logging or analysis.
The __call__ method of BaseComponent (which each subclass will call via super().__call__(state)) can perform any setup needed before the node’s main logic runs (like ensuring the LLM is initialized).

By building on BaseComponent, each agent class can focus on its unique logic (like what prompt to use and how to handle the state), while inheriting the heavy lifting of interacting with GPT-4o reliably.

Topic Generator Agent (`GenerateTopicNode`)

The Topic Generator (topic_generator_node.py) is the first agent in the graph. Its job is to come up with a debatable topic for the session. We give it a prompt that instructs it to output a nuanced topic that could reasonably have a pro and con side.

This agent inherits from BaseComponent and uses a prompt chain (system + human prompt) to generate one item of text – the debate topic. When called, it executes the chain (with no special input, just using the prompt) and gets back a topic_text. It then updates the state with:

debate_topic: the generated topic (stripped of any extra whitespace),
positions: a dictionary assigning the pro and con stances (by default we use "In favor of the topic" and "Against the topic"),
stage: set to "opening",
speaker: set to "pro" (so the Pro side will speak first).

In code, the return might look like:

return {
    "debate_topic": debate_topic,
    "positions": positions,
    "stage": "opening",
    "speaker": first_speaker  # "pro"
}

Here are the prompts for the topic generator:

SYSTEM_PROMPT = """\
You are a brainstorming AI that suggests debate topics.
You will provide a single, interesting or timely topic that can have two opposing views.
"""

HUMAN_PROMPT = """\
Please suggest one debate topic for two AI agents to discuss.
For example, it could be about technology, politics, philosophy, or any interesting domain.
Just provide the topic in a concise sentence.
"""

Then we pass these prompts in the constructor of the class itself.

class GenerateTopicNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        # Create the prompt chain.
        self.chain: RunnableSequence = self.create_chain(
            system_template=SYSTEM_PROMPT,
            human_template=HUMAN_PROMPT
        )

    def __call__(self, state: DebateState) -> Dict[str, str]:
        """
        Generates a debate topic and assigns positions to the two debaters.
        """
        super().__call__(state)

        topic_text = self.execute_chain({})

        # Store the topic and assign stances in the DebateState
        debate_topic = topic_text.strip()
        positions = {
            "pro": "In favor of the topic",
            "con": "Against the topic"
        }

        
        first_speaker = "pro"
        self.logger.info("Welcome to our debate panel! Today's debate topic is: %s", debate_topic)
        return {
            "debate_topic": debate_topic,
            "positions": positions,
            "stage": "opening",
            "speaker": first_speaker
        }

It’s a pattern we will repeat for all classes except for those not using LLMs and the fact checker.

Now we can implement the 2 stars of the show, the Pro and Con argument agents!

Debater Agents (Pro and Con)

Link: pro_debater_node.py

The two debater agents are very similar in structure, but each uses different prompt templates tailored to their role (pro vs con) and the stage of the debate.

The Pro debater, for example, has to handle an opening statement and a counter-argument (countering the Con’s rebuttal). We also need logic for retries in case a statement fails fact-check. In code, the ProDebater class sets up multiple prompt chains:

opening_chain and an opening_retry_chain (using slightly different human prompts – the retry prompt might instruct it to try again without repeating any factually dubious claims).
counter_chain and counter_retry_chain for the counter-argument stage.

class ProDebaterNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        self.opening_chain = self.create_chain(SYSTEM_PROMPT, OPENING_HUMAN_PROMPT)
        self.opening_retry_chain = self.create_chain(SYSTEM_PROMPT, OPENING_RETRY_HUMAN_PROMPT)
        self.counter_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_HUMAN_PROMPT)
        self.counter_retry_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_RETRY_HUMAN_PROMPT)

    def __call__(self, state: DebateState) -> Dict[str, Any]:
        super().__call__(state)

        debate_topic = state.get("debate_topic")
        messages = state.get("messages", [])
        stage = state.get("stage")
        speaker = state.get("speaker")

        # Check if retrying (last message was by pro and not validated)
        last_msg = messages[-1] if messages else None
        retrying = last_msg and last_msg["speaker"] == SPEAKER_PRO and not last_msg["validated"]

        if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            chain = self.opening_retry_chain if retrying else self.opening_chain # select which chain we are triggering: the normal one or the fact-cehcked one
            result = chain.invoke({
                "debate_topic": debate_topic
            })
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            opponent_msg = self._get_last_message_by(SPEAKER_CON, messages)
            debate_history = get_debate_history(messages)
            chain = self.counter_retry_chain if retrying else self.counter_chain
            result = chain.invoke({
                "debate_topic": debate_topic,
                "opponent_statement": opponent_msg,
                "debate_history": debate_history
            })
        else:
            raise ValueError(f"Unknown turn for ProDebater: stage={stage}, speaker={speaker}")
        new_message = create_debate_message(speaker=SPEAKER_PRO, content=result, stage=stage)
        self.logger.info("Speaker: %s, Stage: %s, Retry: %s\nMessage:\n%s", speaker, stage, retrying, result)
        return {
            "messages": messages + [new_message]
        }

    def _get_last_message_by(self, speaker_prefix, messages):
        for m in reversed(messages):
            if m.get("speaker") == speaker_prefix:
                return m["content"]
        return ""

When the ProDebater’s __call__ runs, it looks at the current stage and speaker in the state to decide what to do:

If it’s the opening stage and the speaker is “pro”, it uses the opening_chain to generate an opening argument. If the last message from Pro was marked invalid (not validated), it knows this is a retry, so it would use the opening_retry_chain instead.
If it’s the counter stage and speaker is “pro”, it generates a counter-argument to whatever the opponent (Con) just said. It will fetch the last message by the Con from the messages history, and feed that into the prompt (so that the Pro can directly counter it). Again, if the last Pro message was invalid, it would switch to the retry chain.

After generating its argument, the Debater agent creates a new message entry (with speaker="pro", the content text, validated=False initially, and the stage) and appends it to the state’s message list. That becomes the output of the node (LangGraph will merge this partial state update into the global state).

The Con Debater agent mirrors this logic for its stages:

It similarly appends its message to the state.

It has a rebuttal and closing argument (final argument) stage, each with a normal and a retry chain.

It checks if it’s the rebuttal stage (speaker “con”) or final argument stage (speaker “con”) and invokes the appropriate chain, possibly using the last Pro message for context when rebutting.

con_debater_node.py

By using class-based implementation, our debaters’ code is easier to maintain. We can clearly separate what the Pro does vs what the Con does, even if they share structure. Also, by encapsulating prompt chains inside the class, each debater can manage multiple possible outputs (regular vs retry) cleanly.

Prompt design: The actual prompts (in prompts/pro_debater_prompts.py and con_debater_prompts.py) guide the GPT-4o model to take on a persona (“You are a debater arguing for/against the topic…”) and produce the argument. They also instruct the model to keep statements factual and logical. If a fact check fails, the retry prompt may say something like: “Your previous statement had an unverified claim. Revise your argument to be factually correct while maintaining your position.” – encouraging the model to correct itself.

With this, our AI debaters can engage in a multi-turn duel, and even recover from factual missteps.

Fact Checker Agent (`FactCheckNode`)

After each debater speaks, the Fact Checker agent swoops in to verify their claims. This agent is implemented in fact_checker_node.py, and interestingly, it uses the GPT-4o model’s browsing ability rather than our own custom prompts. Essentially, we delegate the fact-checking to OpenAI’s GPT-4 with web search.

How does this work? The OpenAI Python client for GPT-4 (with browsing) allows us to send a user message and get a structured response. In FactCheckNode.__call__, we do something like:

completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-search-preview",
            web_search_options={},
            messages=[{
                "role": "user",
                "content": (
                        f"Consider the following statement from a debate. "
                        f"If the statement contains numbers, or figures from studies, fact-check it online.\n\n"
                        f"Statement:\n\"{claim}\"\n\n"
                        f"Reply clearly whether any numbers or studies might be inaccurate or hallucinated, and why."
                        f"\n"
                        f"If the statement doesn't contain references to studies or numbers cited, don't go online to fact-check, and just consider it successfully fact-checked, with a 'yes' score.\n\n"
                )
            }],
            response_format=FactCheck
        )

If the result is “yes” (meaning the claim seems truthful or at least not factually wrong), the Fact Checker will mark the last message’s validated field as True in the state, and output {"validated": True} with no further changes. This signals that the debate can continue normally.

If the result is “no” (meaning it found the claim to be incorrect or dubious), the Fact Checker will append a new message to the state with speaker="fact_checker" describing the finding (or we could simply mark it, but providing a brief note like “(Fact Checker: The statistic cited could not be verified.)” can be useful). It will also set validated: False and increment a counter for whichever side made the claim. The output state from this node includes validated: False and an updated times_pro_fact_checked or times_con_fact_checked count.

We also use a Pydantic BaseModel to control the output of the LLM:

class FactCheck(BaseModel):
    """
    Pydantic model for the fact checking the claims made by debaters.

    Attributes:
        binary_score (str): 'yes' if the claim is verifiable and truthful, 'no' otherwise.
    """

    binary_score: str = Field(
        description="Indicates if the claim is verifiable and truthful. 'yes' or 'no'."
    )
    justification: str = Field(
        description="Explanation of the reasoning behind the score."
    )

Debate Moderator Agent (`DebateModeratorNode`)

The Debate Moderator is the conductor of the debate. Instead of producing lengthy text, this agent’s job is to manage turn-taking and stage progression. In the workflow, after a statement is validated by the Fact Checker, control passes to the Moderator node. The Moderator then issues a Command that updates the state for the next turn and directs the flow to the appropriate next agent.

The logic in DebateModeratorNode.__call__ (see nodes/debate_moderator_node.py) goes roughly like this:

if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_REBUTTAL, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_REBUTTAL and speaker == SPEAKER_CON:
            return Command(
                update={"stage": STAGE_COUNTER, "speaker": SPEAKER_PRO},
                goto=NODE_PRO_DEBATER
            )
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_FINAL_ARGUMENT, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_FINAL_ARGUMENT and speaker == SPEAKER_CON:
            return Command(
                update={},
                goto=NODE_JUDGE
            )

        raise ValueError(f"Unexpected stage/speaker combo: stage={stage}, speaker={speaker}")

Each conditional corresponds to a point in the debate where a turn just ended, and sets up the next turn. For example, after the opening (Pro just spoke), it sets stage to rebuttal, switches speaker to Con, and directs the workflow to the Con debater node. After the final_argument (Con’s closing), it directs to the Judge with no further update (the debate stage effectively ends).

Fact Check Router (`FactCheckRouterNode`)

This is another control node (like the Moderator) that introduces conditional logic. The Fact Check Router sits right after the Fact Checker agent in the flow. Its purpose is to branch the workflow depending on the fact-check result.

In nodes/fact_check_router_node.py, the logic is:

if pro_fact_checks >= 3 or con_fact_checks >= 3:
            disqualified = SPEAKER_PRO if pro_fact_checks >= 3 else SPEAKER_CON
            winner = SPEAKER_CON if disqualified == SPEAKER_PRO else SPEAKER_PRO

            verdict_msg = {
                "speaker": "moderator",
                "content": (
                    f"Debate ended early due to excessive factual inaccuracies.\n\n"
                    f"DISQUALIFIED: {disqualified.upper()} (exceeded fact check limit)\n"
                    f"WINNER: {winner.upper()}"
                ),
                "validated": True,
                "stage": "verdict"
            }
            return Command(
                update={"messages": messages + [verdict_msg]},
                goto=END
            )
        if last_message.get("validated"):
            return Command(goto=NODE_DEBATE_MODERATOR)
        elif speaker == SPEAKER_PRO:
            return Command(goto=NODE_PRO_DEBATER)
        elif speaker == SPEAKER_CON:
            return Command(goto=NODE_CON_DEBATER)
        raise ValueError("Unable to determine routing in FactCheckRouterNode.")

First, the Fact Check Router checks if either side’s fact-check count has reached 3. If so, it creates a Moderator-style message announcing an early end: the offending side is disqualified and the other side is the winner. It appends this verdict to the messages and returns a Command that jumps to END, effectively terminating the debate without going to the Judge (because we already know the outcome).

If we’re not ending the debate early, it then looks at the Fact Checker’s result for the last message (which is stored as validated on that message). If validated is True, we go to the debate moderator: Command(goto=debate_moderator_node).

Else if the statement fails fact-check, the workflow goes back to the debater to produce a revised statement (with the state counters updated to reflect the failure). This loop can happen multiple times if needed (up to the disqualification limit).

This dynamic control is the heart of Deb8flow’s “agentic” nature – the ability to adapt the path of execution based on the content of the agents’ outputs. It showcases LangGraph’s strength: combining control flow with state. We’re essentially encoding debate rules (like allowing retries for false claims, or ending the debate if someone cheats too often) directly into the workflow graph.

Judge Agent (`JudgeNode`)

Last but not least, the Judge agent delivers the final verdict based on rhetorical skill, clarity, structure, and overall persuasiveness. Its system prompt and human prompt make this explicit:

System Prompt: “You are an impartial debate judge AI. … Evaluate which debater presented their case more clearly, persuasively, and logically. You must focus on communication skills, structure of argument, rhetorical strength, and overall coherence.”
Human Prompt: “Here is the full debate transcript. Please analyze the performance of both debaters—PRO and CON. Evaluate rhetorical performance—clarity, structure, persuasion, and relevance—and decide who presented their case more effectively.”

When the Judge node runs, it receives the entire debate transcript (all validated messages) alongside the original topic. It then uses GPT-4o to examine how each side framed their arguments, handled counterpoints, and supported (or failed to support) claims with examples or logic. Crucially, the Judge is forbidden to evaluate which position is objectively correct (or who it thinks might be correct)—only who argued more persuasively.

Below is an example final verdict from a Deb8flow run on the topic:
“Should governments implement a universal basic income in response to increasing automation in the workforce?”

WINNER: PRO

REASON: The PRO debater presented a more compelling and rhetorically effective case for universal basic income. Their arguments were well-structured, beginning with a clear statement of the issue and the necessity of UBI in response to automation. They effectively addressed potential counterarguments by highlighting the unprecedented speed and scope of current technological changes, which distinguishes the current situation from past technological shifts. The PRO also provided empirical evidence from UBI pilot programs to counter the CON's claims about work disincentives and economic inefficiencies, reinforcing their argument with real-world examples.

In contrast, the CON debater, while presenting valid concerns about UBI, relied heavily on historical analogies and assumptions about workforce adaptability without adequately addressing the unique challenges posed by modern automation. Their arguments about the fiscal burden and potential inefficiencies of UBI were less supported by specific evidence compared to the PRO's rebuttals.

Overall, the PRO's arguments were more coherent, persuasive, and backed by empirical evidence, making their case more convincing to a neutral observer.

Langsmith Tracing

Throughout Deb8flow’s development, I relied on LangSmith (LangChain’s tracing and observability toolkit) to ensure the entire debate pipeline was behaving correctly. Because we have multiple agents passing control between themselves, it’s easy for unexpected loops or misrouted states to occur. LangSmith provides a convenient way to:

Visualize Execution Flow: You can see each agent’s prompt, the tokens consumed (so you can also track costs), and any intermediate states. This makes it much simpler to confirm that, say, the Con Debater is properly referencing the Pro Debater’s last message, or that the Fact Checker is accurately receiving the claim to verify.
Debug State Updates: If the Moderator or Fact Check Router is sending the flow to the wrong node, the trace will highlight that mismatch. You can trace which agent was invoked at each step and why, helping you spot stage or speaker misalignments early.
Track Prompt and Completion Tokens: With multiple GPT-4o calls, it’s useful to see how many tokens each stage is using, which LangSmith logs automatically if you enable tracing.

Integrating LangSmith is unexpectedly easy. You will just need to provide these 3 keys in your .env file: LANGCHAIN_API_KEY

LANGCHAIN_TRACING_V2

LANGCHAIN_PROJECT

Then you can open the LangSmith UI to see a structured trace of each run. This greatly reduces the guesswork involved in debugging multi-agent systems and is, in my experience, essential for more complex AI orchestration like ours. Example of a single run:

The trace in waterfall mode in Lansmith of one run, showing how the whole flow ran. Source: Generated by the author using Langsmith.

Reflections and Next Steps

Building Deb8flow was an eye-opening exercise in orchestrating autonomous agent workflows. We didn’t just chain a single model call – we created an entire debate simulation with AI agents, each with a specific role, and allowed them to interact according to a set of rules. LangGraph provided a clear framework to define how data and control flows between agents, making the complex sequence manageable in code. By using class-based agents and a shared state, we maintained modularity and clarity, which will pay off for any software engineering project in the long run.

An exciting aspect of this project was seeing emergent behavior. Even though each agent follows a script (a prompt), the unscripted combination – a debater trying to deceive, a fact-checker catching it, the debater rephrasing – felt surprisingly realistic! It’s a small step toward more Agentic Ai systems that can perform non-trivial multi-step tasks with oversight on each other.

There’s plenty of ideas for improvement:

User Interaction: Currently it’s fully autonomous, but one could add a mode where a human provides the topic or even takes the role of one side against an AI opponent.
We can switch the order in which the Debaters talk.
We can change the prompts, and thus to a good degree the behavior of the agents, and experiment with different prompts.
Make the debaters also perform web search before producing their statements, thus providing them with the latest information.

The broader implication of Deb8flow is how it showcases a pattern for composable AI agents. By defining clear boundaries and interactions (just like microservices in software), we can have complex AI-driven processes that remain interpretable and controllable. Each agent is like a cog in a machine, and LangGraph is the gear system making them work in unison.

I found this project energizing, and I hope it inspires you to explore multi-agent workflows. Whether it’s debating, collaborating on writing, or solving problems from different expert angles, the combination of GPT, tools, and structured agentic workflows opens up a new world of possibilities for AI development. Happy hacking!

References

[1] D. Bouchard, “From Basics to Advanced: Exploring LangGraph,” Medium, Nov. 22, 2023. [Online]. Available: https://medium.com/data-science/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787. [Accessed: Apr. 1, 2025].

[2] A. W. T. Ng, “Building a Research Agent that Can Write to Google Docs: Part 1,” Towards Data Science, Jan. 11, 2024. [Online]. Available: https://towardsdatascience.com/building-a-research-agent-that-can-write-to-google-docs-part-1-4b49ea05a292/. [Accessed: Apr. 1, 2025].

The post Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o appeared first on Towards Data Science.

The Case for Centralized AI Model Inference Serving

Chaim Rand — Wed, 02 Apr 2025 01:52:26 +0000

As AI models continue to increase in scope and accuracy, even tasks once dominated by traditional algorithms are gradually being replaced by Deep Learning models. Algorithmic pipelines — workflows that take an input, process it through a series of algorithms, and produce an output — increasingly rely on one or more AI-based components. These AI models often have significantly different resource requirements than their classical counterparts, such as higher memory usage, reliance on specialized hardware accelerators, and increased computational demands.

In this post, we address a common challenge: efficiently processing large-scale inputs through algorithmic pipelines that include deep learning models. A typical solution is to run multiple independent jobs, each responsible for processing a single input. This setup is often managed with job orchestration frameworks (e.g., Kubernetes). However, when deep learning models are involved, this approach can become inefficient as loading and executing the same model in each individual process can lead to resource contention and scaling limitations. As AI models become increasingly prevalent in algorithmic pipelines, it is crucial that we revisit the design of such solutions.

In this post we evaluate the benefits of centralized Inference serving, where a dedicated inference server handles prediction requests from multiple parallel jobs. We define a toy experiment in which we run an image-processing pipeline based on a ResNet-152 image classifier on 1,000 individual images. We compare the runtime performance and resource utilization of the following two implementations:

Decentralized inference — each job loads and runs the model independently.
Centralized inference — all jobs send inference requests to a dedicated inference server.

To keep the experiment focused, we make several simplifying assumptions:

Instead of using a full-fledged job orchestrator (like Kubernetes), we implement parallel process execution using Python’s multiprocessing module.
While real-world workloads often span multiple nodes, we run everything on a single node.
Real-world workloads typically include multiple algorithmic components. We limit our experiment to a single component — a ResNet-152 classifier running on a single input image.
In a real-world use case, each job would process a unique input image. To simplify our experiment setup, each job will process the same kitty.jpg image.
We will use a minimal deployment of a TorchServe inference server, relying mostly on its default settings. Similar results are expected with alternative inference server solutions such as NVIDIA Triton Inference Server or LitServe.

The code is shared for demonstrative purposes only. Please do not interpret our choice of TorchServe — or any other component of our demonstration — as an endorsement of its use.

Toy Experiment

We conduct our experiments on an Amazon EC2 c5.2xlarge instance, with 8 vCPUs and 16 GiB of memory, running a PyTorch Deep Learning AMI (DLAMI). We activate the PyTorch environment using the following command:

source /opt/pytorch/bin/activate

Step 1: Creating a TorchScript Model Checkpoint

We begin by creating a ResNet-152 model checkpoint. Using TorchScript, we serialize both the model definition and its weights into a single file:

import torch
from torchvision.models import resnet152, ResNet152_Weights

model = resnet152(weights=ResNet152_Weights.DEFAULT)
model = torch.jit.script(model)
model.save("resnet-152.pt")

Step 2: Model Inference Function

Our inference function performs the following steps:

Load the ResNet-152 model.
Load an input image.
Preprocess the image to match the input format expected by the model, following the implementation defined here.
Run inference to classify the image.
Post-process the model output to return the top five label predictions, following the implementation defined here.

We define a constant MAX_THREADS hyperparameter that we use to restrict the number of threads used for model inference in each process. This is to prevent resource contention between the multiple jobs.

import os, time, psutil
import multiprocessing as mp
import torch
import torch.nn.functional as F
import torchvision.transforms as transforms
from PIL import Image


def predict(image_id):
    # Limit each process to 1 thread
    MAX_THREADS = 1
    os.environ["OMP_NUM_THREADS"] = str(MAX_THREADS)
    os.environ["MKL_NUM_THREADS"] = str(MAX_THREADS)
    torch.set_num_threads(MAX_THREADS)

    # load the model
    model = torch.jit.load('resnet-152.pt').eval()

    # Define image preprocessing steps
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                             std=[0.229, 0.224, 0.225])
    ])

    # load the image
    image = Image.open('kitten.jpg').convert("RGB")
    
    # preproc
    image = transform(image).unsqueeze(0)

    # perform inference
    with torch.no_grad():
        output = model(image)

    # postproc
    probabilities = F.softmax(output[0], dim=0)
    probs, classes = torch.topk(probabilities, 5, dim=0)
    probs = probs.tolist()
    classes = classes.tolist()

    return dict(zip(classes, probs))

Step 3: Running Parallel Inference Jobs

We define a function that spawns parallel processes, each processing a single image input. This function:

Accepts the total number of images to process and the maximum number of concurrent jobs.
Dynamically launches new processes when slots become available.
Monitors CPU and memory usage throughout execution.

def process_image(image_id):
    print(f"Processing image {image_id} (PID: {os.getpid()})")
    predict(image_id)

def spawn_jobs(total_images, max_concurrent):
    start_time = time.time()
    max_mem_utilization = 0.
    max_utilization = 0.

    processes = []
    index = 0
    while index < total_images or processes:

        while len(processes) < max_concurrent and index < total_images:
            # Start a new process
            p = mp.Process(target=process_image, args=(index,))
            index += 1
            p.start()
            processes.append(p)

        # sample memory utilization
        mem_usage = psutil.virtual_memory().percent
        max_mem_utilization = max(max_mem_utilization, mem_usage)
        cpu_util = psutil.cpu_percent(interval=0.1)
        max_utilization = max(max_utilization, cpu_util)

        # Remove completed processes from list
        processes = [p for p in processes if p.is_alive()]

    total_time = time.time() - start_time
    print(f"\nTotal Processing Time: {total_time:.2f} seconds")
    print(f"Max CPU Utilization: {max_utilization:.2f}%")
    print(f"Max Memory Utilization: {max_mem_utilization:.2f}%")

spawn_jobs(total_images=1000, max_concurrent=32)

Estimating the Maximum Number of Processes

While the optimal number of maximum concurrent processes is best determined empirically, we can estimate an upper bound based on the 16 GiB of system memory and the size of the resnet-152.pt file, 231 MB.

The table below summarizes the runtime results for several configurations:

Decentralized Inference Results (by Author)

Although memory becomes fully saturated at 50 concurrent processes, we observe that maximum throughput is achieved at 8 concurrent jobs — one per vCPU. This indicates that beyond this point, resource contention outweighs any potential gains from additional parallelism.

The Inefficiencies of Independent Model Execution

Running parallel jobs that each load and execute the model independently introduces significant inefficiencies and waste:

Each process needs to allocate the appropriate memory resources for storing its own copy of the AI model.
AI models are compute-intensive. Executing them in many processes in parallel can lead to resource contention and reduced throughput.
Loading the model checkpoint file and initializing the model in each process adds overhead and can further increase latency. In the case of our toy experiment, model initialization makes up for roughly 30%(!!) of the overall inference processing time.

A more efficient alternative is to centralize inference execution using a dedicated model inference server. This approach would eliminate redundant model loading and reduce overall system resource utilization.

In the next section we will set up an AI model inference server and assess its impact on resource utilization and runtime performance.

Note: We could have modified our multiprocessing-based approach to share a single model across processes (e.g., using torch.multiprocessing or another solution based on shared memory). However, the inference server demonstration better aligns with real-world production environments, where jobs often run in isolated containers.

TorchServe Setup

The TorchServe setup described in this section loosely follows the resnet tutorial. Please refer to the official TorchServe documentation for more in-depth guidelines.

Installation

The PyTorch environment of our DLAMI comes preinstalled with TorchServe executables. If you are running in a different environment run the following installation command:

pip install torchserve torch-model-archiver

Creating a Model Archive

The TorchServe Model Archiver packages the model and its associated files into a “.mar” file archive, the format required for deployment on TorchServe. We create a TorchServe model archive file based on our model checkpoint file and using the default image_classifier handler:

mkdir model_store
torch-model-archiver \
    --model-name resnet-152 \
    --serialized-file resnet-152.pt \
    --handler image_classifier \
    --version 1.0 \
    --export-path model_store

TorchServe Configuration

We create a TorchServe config.properties file to define how TorchServe should operate:

model_store=model_store
load_models=resnet-152.mar
models={\
  "resnet-152": {\
    "1.0": {\
        "marName": "resnet-152.mar"\
    }\
  }\
}

# Number of workers per model
default_workers_per_model=1

# Job queue size (default is 100)
job_queue_size=100

After completing these steps, our working directory should look like this:

├── config.properties
֫├── kitten.jpg
├── model_store
│   ├── resnet-152.mar
├── multi_job.py

Starting TorchServe

In a separate shell we start our TorchServe inference server:

source /opt/pytorch/bin/activate
torchserve \
    --start \
    --disable-token-auth \
    --enable-model-api \
    --ts-config config.properties

Inference Request Implementation

We define an alternative prediction function that calls our inference service:

import requests

def predict_client(image_id):
    with open('kitten.jpg', 'rb') as f:
        image = f.read()
    response = requests.post(
        "http://127.0.0.1:8080/predictions/resnet-152",
        data=image,
        headers={'Content-Type': 'application/octet-stream'}
    )

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error from inference server: {response.text}")

Scaling Up the Number of Concurrent Jobs

Now that inference requests are being processed by a central server, we can scale up parallel processing. Unlike the earlier approach where each process loaded and executed its own model, we have sufficient CPU resources to allow for many more concurrent processes. Here we choose 100 processes in accordance with the default job_queue_size capacity of the inference server:

spawn_jobs(total_images=1000, max_concurrent=100)

Results

The performance results are captured in the table below. Keep in mind that the comparative results can vary greatly based on the details of the AI model and the runtime environment.

Inference Server Results (by Author)

By using a centralized inference server, not only have we have increased overall throughput by more than 2X, but we have freed significant CPU resources for other computation tasks.

Next Steps

Now that we have effectively demonstrated the benefits of a centralized inference serving solution, we can explore several ways to enhance and optimize the setup. Recall that our experiment was intentionally simplified to focus on demonstrating the utility of inference serving. In real-world deployments, additional enhancements may be required to tailor the solution to your specific needs.

Custom Inference Handlers: While we used TorchServe’s built-in image_classifier handler, defining a custom handler provides much greater control over the details of the inference implementation.
Advanced Inference Server Configuration: Inference server solutions will typically include many features for tuning the service behavior according to the workload requirements. In the next sections we will explore some of the features supported by TorchServe.
Expanding the Pipeline: Real world models will typically include more algorithm blocks and more sophisticated AI models than we used in our experiment.
Multi-Node Deployment: While we ran our experiments on a single compute instance, production setups will typically include multiple nodes.
Alternative Inference Servers: While TorchServe is a popular choice and relatively easy to set up, there are many alternative inference server solutions that may provide additional benefits and may better suit your needs. Importantly, it was recently announced that TorchServe would no longer be actively maintained. See the documentation for details.
Alternative Orchestration Frameworks: In our experiment we use Python multiprocessing. Real-world workloads will typically use more advanced orchestration solutions.
Utilizing Inference Accelerators: While we executed our model on a CPU, using an AI accelerator (e.g., an NVIDIA GPU, a Google Cloud TPU, or an AWS Inferentia) can drastically improve throughput.
Model Optimization: Optimizing your AI models can greatly increase efficiency and throughput.
Auto-Scaling for Inference Load: In some use cases inference traffic will fluctuate, requiring an inference server solution that can scale its capacity accordingly.

In the next sections we explore two simple ways to enhance our TorchServe-based inference server implementation. We leave the discussion on other enhancements to future posts.

Batch Inference with TorchServe

Many model inference service solutions support the option of grouping inference requests into batches. This usually results in increased throughput, especially when the model is running on a GPU.

We extend our TorchServe config.properties file to support batch inference with a batch size of up to 8 samples. Please see the official documentation for details on batch inference with TorchServe.

model_store=model_store
load_models=resnet-152.mar
models={\
  "resnet-152": {\
    "1.0": {\
        "marName": "resnet-152.mar",\
        "batchSize": 8,\
        "maxBatchDelay": 100,\
        "responseTimeout": 200\
    }\
  }\
}

# Number of workers per model
default_workers_per_model=1

# Job queue size (default is 100)
job_queue_size=100

Results

We append the results in the table below:

Batch Inference Server Results (by Author)

Enabling batched inference increases the throughput by an additional 26.5%.

Multi-Worker Inference with TorchServe

Many model inference service solutions will support creating multiple inference workers for each AI model. This enables fine-tuning the number of inference workers based on expected load. Some solutions support auto-scaling of the number of inference workers.

We extend our own TorchServe setup by increasing the default_workers_per_model setting that controls the number of inference workers assigned to our image classification model.

Importantly, we must limit the number of threads allocated to each worker to prevent resource contention. This is controlled by the number_of_netty_threads setting and by the OMP_NUM_THREADS and MKL_NUM_THREADS environment variables. Here we have set the number of threads to equal the number of vCPUs (8) divided by the number of workers.

model_store=model_store
load_models=resnet-152.mar
models={\
  "resnet-152": {\
    "1.0": {\
        "marName": "resnet-152.mar"\
        "batchSize": 8,\
        "maxBatchDelay": 100,\
        "responseTimeout": 200\
    }\
  }\
}

# Number of workers per model
default_workers_per_model=2 

# Job queue size (default is 100)
job_queue_size=100

# Number of threads per worker
number_of_netty_threads=4

The modified TorchServe startup sequence appears below:

export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
torchserve \
    --start \
    --disable-token-auth \
    --enable-model-api \
    --ts-config config.properties

Results

In the table below we append the results of running with 2, 4, and 8 inference workers:

Multi-Worker Inference Server Results (by Author)

By configuring TorchServe to use multiple inference workers, we are able to increase the throughput by an additional 36%. This amounts to a 3.75X improvement over the baseline experiment.

Summary

This experiment highlights the potential impact of inference server deployment on multi-job deep learning workloads. Our findings suggest that using an inference server can improve system resource utilization, enable higher concurrency, and significantly increase overall throughput. Keep in mind that the precise benefits will greatly depend on the details of the workload and the runtime environment.

Designing the inference serving architecture is just one part of optimizing AI model execution. Please see some of our many posts covering a wide range AI model optimization techniques.

The post The Case for Centralized AI Model Inference Serving appeared first on Towards Data Science.

A Simple Implementation of the Attention Mechanism from Scratch

Marcello Politi — Tue, 01 Apr 2025 01:05:51 +0000

Introduction

The Attention Mechanism is often associated with the transformer architecture, but it was already used in RNNs. In Machine Translation or MT (e.g., English-Italian) tasks, when you want to predict the next Italian word, you need your model to focus, or pay attention, on the most important English words that are useful to make a good translation.

I will not go into details of RNNs, but attention helped these models to mitigate the vanishing gradient problem and to capture more long-range dependencies among words.

At a certain point, we understood that the only important thing was the attention mechanism, and the entire RNN architecture was overkill. Hence, Attention is All You Need!

Self-Attention in Transformers

Classical attention indicates where words in the output sequence should focus attention in relation to the words in input sequence. This is important in sequence-to-sequence tasks like MT.

The self-attention is a specific type of attention. It operates between any two elements in the same sequence. It provides information on how “correlated” the words are in the same sentence.

For a given token (or word) in a sequence, self-attention generates a list of attention weights corresponding to all other tokens in the sequence. This process is applied to each token in the sentence, obtaining a matrix of attention weights (as in the picture).

This is the general idea, in practice things are a bit more complicated because we want to add many learnable parameters to our neural network, let’s see how.

K, V, Q representations

Our model input is a sentence like “my name is Marcello Politi”. With the process of tokenization, a sentence is converted into a list of numbers like [2, 6, 8, 3, 1].

Before feeding the sentence into the transformer we need to create a dense representation for each token.

How to create this representation? We multiply each token by a matrix. The matrix is learned during training.

Let’s add some complexity now.

For each token, we create 3 vectors instead of one, we call these vectors: key, value and query. (We see later how we create these 3 vectors).

Conceptually these 3 tokens have a particular meaning:

The vector key represents the core information captured by the token
The vector value captures the full information of a token
The vector query, it’s a question about the token relevance for the current task.

So the idea is that we focus on a particular token i , and we want to ask what is the importance of the other tokens in the sentence regarding the token i we are taking into consideration.

This means that we take the vector q_i (we ask a question regarding i) for token i, and we do some mathematical operations with all the other tokens k_j (j!=i). This is like wondering at first glance what are the other tokens in the sequence that look really important to understand the meaning of token i.

What is this magical mathematical operation?

We need to multiply (dot-product) the query vector by the key vectors and divide by a scaling factor. We do this for each k_j token.

In this way, we obtain a score for each pair (q_i, k_j). We make this list become a probability distribution by applying a softmax operation on it. Great now we have obtained the attention weights!

With the attention weights, we know what is the importance of each token k_j to for undestandin the token i. So now we multiply the value vector v_j associated with each token per its weight and we sum the vectors. In this way we obtain the final context-aware vector of token_i.

If we are computing the contextual dense vector of token_1 we calculate:

z1 = a11*v1 + a12*v2 + … + a15*v5

Where a1j are the computer attention weights, and v_j are the value vectors.

Done! Almost…

I didn’t cover how we obtained the vectors k, v and q of each token. We need to define some matrices w_k, w_v and w_q so that when we multiply:

token * w_k -> k
token * w_q -> q
token * w_v -> v

These 3 matrices are set at random and are learned during training, this is why we have many parameters in modern models such as LLMs.

Multi-head Self-Attention in Transformers (MHSA)

Are we sure that the previous self-attention mechanism is able to capture all important relationships among tokens (words) and create dense vectors of those tokens that really make sense?

It could actually not work always perfectly. What if to mitigate the error we re-run the entire thing 2 times with new w_q, w_k and w_v matrices and somehow merge the 2 dense vectors obtained? In this way maybe one self-attention managed to capture some relationship and the other managed to capture some other relationship.

Well, this is what exactly happens in MHSA. The case we just discussed contains two heads because it has two sets of w_q, w_k and w_v matrices. We can have even more heads: 4, 8, 16 etc.

The only complicated thing is that all these heads are managed in parallel, we process the all in the same computation using tensors.

The way we merge the dense vectors of each head is simple, we concatenate them (hence the dimension of each vector shall be smaller so that when concat them we obtain the original dimension we wanted), and we pass the obtained vector through another w_o learnable matrix.

Hands-on

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">Python">import torch

Suppose you have a sentence. After tokenization, each token (word for simplicity) corresponds to an index (number):

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">tokenized_sentence = torch.tensor([
    2, #my
    6, #name
    8, #is
    3, #marcello
    1  #politi
])
tokenized_sentence

Before feeding the sentence into the transofrmer we need to create a dense representation for each token.

How to create these representation? We multiply each token per a matrix. This matrix is learned during training.

Let’s build this embedding matrix.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">torch.manual_seed(0) # set a fixed seed for reproducibility
embed = torch.nn.Embedding(10, 16)

If we multiply our tokenized sentence with the embeddings, we obtain a dense representation of dimension 16 for each token

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">sentence_embed = embed(tokenized_sentence).detach()
sentence_embed

In order to use the attention mechanism we need to create 3 new We define 3 matrixes w_q, w_k and w_v. When we multiply one input token time the w_q we obtain the vector q. Same with w_k and w_v.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">d = sentence_embed.shape[1] # let's base our matrix on a shape (16,16)

w_key = torch.rand(d,d)
w_query = torch.rand(d,d)
w_value = torch.rand(d,d)

Compute attention weights

Let’s now compute the attention weights for only the first input token of the sentence.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">token1_embed = sentence_embed[0]

#compute the tre vector associated to token1 vector : q,k,v
key_1 = w_key.matmul(token1_embed)
query_1 = w_query.matmul(token1_embed)
value_1 = w_value.matmul(token1_embed)

print("key vector for token1: \n", key_1)   
print("query vector for token1: \n", query_1)
print("value vector for token1: \n", value_1)

We need to multiply the query vector associated to token1 (query_1) with all the keys of the other vectors.

So now we need to compute all the keys (key_2, key_2, key_4, key_5). But wait, we can compute all of these in one time by multiplying the sentence_embed times the w_k matrix.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">keys = sentence_embed.matmul(w_key.T)
keys[0] #contains the key vector of the first token and so on

Let’s do the same thing with the values

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">values = sentence_embed.matmul(w_value.T)
values[0] #contains the value vector of the first token and so on

Let’s compute the first part of the attions formula.

import torch.nn.functional as F

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># the following are the attention weights of the first tokens to all the others
a1 = F.softmax(query_1.matmul(keys.T)/d**0.5, dim = 0)
a1

With the attention weights we know what is the importance of each token. So now we multiply the value vector associated to each token per its weight.

To obtain the final context aware vector of token_1.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">z1 = a1.matmul(values)
z1

In the same way we could compute the context aware dense vectors of all the other tokens. Now we are always using the same matrices w_k, w_q, w_v. We say that we use one head.

But we can have multiple triplets of matrices, so multi-head. That’s why it is called multi-head attention.

The dense vectors of an input tokens, given in oputut from each head are at then end concatenated and linearly transformed to get the final dense vector.

Implementing MultiheadSelf-Attention

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0) # fixed seed for reproducibility

Same steps as before…

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Tokenized sentence (same as yours)
tokenized_sentence = torch.tensor([2, 6, 8, 3, 1])  # [my, name, is, marcello, politi]

# Embedding layer: vocab size = 10, embedding dim = 16
embed = nn.Embedding(10, 16)
sentence_embed = embed(tokenized_sentence).detach()  # Shape: [5, 16] (seq_len, embed_dim)

We’ll define a multi-head attention mechanism with h heads (let’s say 4 heads for this example). Each head will have its own w_q, w_k, and w_v matrices, and the output of each head will be concatenated and passed through a final linear layer.

Since the output of the head will be concatenated, and we want a final dimension of d, the dimension of each head needs to be d/h. Additionally each concatenated vector will go though a linear transformation, so we need another matrix w_ouptut as you can see in the formula.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">d = sentence_embed.shape[1]  # embed dimension 16
h = 4  # Number of heads
d_k = d // h  # Dimension per head (16 / 4 = 4)

Since we have 4 heads, we want 4 copies for each matrix. Instead of copies, we add a dimension, which is the same thing, but we only do one operation. (Imagine stacking matrices on top of each other, its the same thing).

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Define weight matrices for each head
w_query = torch.rand(h, d, d_k)  # Shape: [4, 16, 4] (one d x d_k matrix per head)
w_key = torch.rand(h, d, d_k)    # Shape: [4, 16, 4]
w_value = torch.rand(h, d, d_k)  # Shape: [4, 16, 4]
w_output = torch.rand(d, d)  # Final linear layer: [16, 16]

I’m using for simplicity torch’s einsum. If you’re not familiar with it check out my blog post.

The einsum operation torch.einsum('sd,hde->hse', sentence_embed, w_query) in PyTorch uses letters to define how to multiply and rearrange numbers. Here’s what each part means:

Input Tensors:
- sentence_embed with the notation 'sd':
  - s represents the number of words (sequence length), which is 5.
  - d represents the number of numbers per word (embedding size), which is 16.
  - The shape of this tensor is [5, 16].
- w_query with the notation 'hde':
  - h represents the number of heads, which is 4.
  - d represents the embedding size, which again is 16.
  - e represents the new number size per head (d_k), which is 4.
  - The shape of this tensor is [4, 16, 4].
Output Tensor:
- The output has the notation 'hse':
  - h represents 4 heads.
  - s represents 5 words.
  - e represents 4 numbers per head.
  - The shape of the output tensor is [4, 5, 4].

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Compute Q, K, V for all tokens and all heads
# sentence_embed: [5, 16] -> Q: [4, 5, 4] (h, seq_len, d_k)
queries = torch.einsum('sd,hde->hse', sentence_embed, w_query)  # h heads, seq_len tokens, d dim
keys = torch.einsum('sd,hde->hse', sentence_embed, w_key)       # h heads, seq_len tokens, d dim
values = torch.einsum('sd,hde->hse', sentence_embed, w_value)   # h heads, seq_len tokens, d dim

This einsum equation performs a dot product between the queries (hse) and the transposed keys (hek) to obtain scores of shape [h, seq_len, seq_len], where:

h -> Number of heads.
s and k -> Sequence length (number of tokens).
e -> Dimension of each head (d_k).

The division by (d_k ** 0.5) scales the scores to stabilize gradients. Softmax is then applied to obtain attention weights:

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Compute attention scores
scores = torch.einsum('hse,hek->hsk', queries, keys.transpose(-2, -1)) / (d_k ** 0.5)  # [4, 5, 5]
attention_weights = F.softmax(scores, dim=-1)  # [4, 5, 5]

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Apply attention weights
head_outputs = torch.einsum('hij,hjk->hik', attention_weights, values)  # [4, 5, 4]
head_outputs.shape

Now we concatenate all the heads of token 1

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Concatenate heads
concat_heads = head_outputs.permute(1, 0, 2).reshape(sentence_embed.shape[0], -1)  # [5, 16]
concat_heads.shape

Let’s finally multiply per the last w_output matrix as in the formula above

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">multihead_output = concat_heads.matmul(w_output)  # [5, 16] @ [16, 16] -> [5, 16]
print("Multi-head attention output for token1:\n", multihead_output[0])

Final Thoughts

In this blog post I’ve implemented a simple version of the attention mechanism. This is not how it is really implemented in modern frameworks, but my scope is to provide some insights to allow anyone an understanding of how this works. In future articles I’ll go through the entire implementation of a transformer architecture.

Follow me on TDS if you like this article!

Linkedin | X (Twitter) | Website

Unless otherwise noted, images are by the author

The post A Simple Implementation of the Attention Mechanism from Scratch appeared first on Towards Data Science.

Understanding the Tech Stack Behind Generative AI

Sarah Schürch — Tue, 01 Apr 2025 00:35:03 +0000

Understanding the Tech Stack Behind Generative AI

When ChatGPT reached the one million user mark within five days and took off faster than any other technology in history, the world began to pay attention to artificial intelligence and AI applications.

And so it continued apace. Since then, many different terms have been buzzing around — from ChatGPT and Nvidia H100 chips to Ollama, LangChain, and Explainable AI. What is actually meant for what?

That’s exactly what you’ll find in this article: A structured overview of the technology ecosystem around generative AI and LLMs.

Let’s dive in!

Table of Contents
1 What makes generative AI work – at its core
2 Scaling AI: Infrastructure and Compute Power
3 The Social Layer of AI: Explainability, Fairness and Governance
4 Emerging Abilities: When AI Starts to Interact and Act
Final Thoughts
Where Can You Continue Learning?

1 What makes generative AI work – at its core

New terms and tools in the field of artificial intelligence seem to emerge almost daily. At the core of it all are the foundational models, frameworks and the infrastructure required to run generative AI in the first place.

Foundation Models

Do you know the Swiss Army Knife? Foundation models are like such a multifunctional knife – you can perform many different tasks with just one tool.

Foundation models are large AI models that have been pre-trained on huge amounts of data (text, code, images, etc.). What is special about these models is that they can not only solve a single task but can also be used flexibly for many different applications. They can write texts, correct code, generate images or even compose music. And they are the basis for many generative AI applications.

The following three aspects are key to understanding foundation models:

Pre-trained
These models were trained on huge data sets. This means that the model has ‘read’ a huge amount of text or other data. This phase is very costly and time-consuming.
Multitask-capable
These foundation models can solve many tasks. If we look at GPT-4o, you can use it to solve everyday questions about knowledge questions, text improvements and code generation.
Transferable
Through fine-tuning or Retrieval Augmented Generation (RAG), we can adapt such Foundation Models to specific domains or specialise them for specific application areas. I have written about RAG and fine-tuning in detail in How to Make Your LLM More Accurate with RAG & Fine-Tuning. But the core of it is that you have two options to make your LLM more accurate: With RAG, the model remains the same, but you improve the input by providing the model with additional sources. For example, the model can access past support tickets or legal texts during a query – but the model parameters and weightings remain unchanged. With fine-tuning, you retrain the pre-trained model with additional sources – the model saves this knowledge permanently.

To get a feel for the amount of data we are talking about, let’s look at FineWeb. FineWeb is a massive dataset developed by Hugging Face to support the pre-training phase of LLMs. The dataset was created from 96 common-crawl snapshots and comprises 15 trillion tokens – which takes up about 44 terabytes of storage space.

Most foundation models are based on the Transformer architecture. In this article, I won’t go into this in more detail as it’s about the high-level components around AI. The most important thing to understand is that these models can look at the entire context of a sentence at the same time, for example – and not just read word by word from left to right. The foundational paper introducing this architecture was Attention is All You Need (2017).

All major players in the AI field have released foundation models — each with different strengths, use cases, and licensing conditions (open-source or closed-source).

GPT-4 from OpenAI, Claude from Anthropic and Gemini from Google, for example, are powerful but closed models. This means that neither the model weights nor the training data are accessible to the public.

There are also high-performing open-source models from Meta, such as LLaMA 2 and LLaMA 3, as well as from Mistral and DeepSeek.

A great resource for comparing these models is the LLM Arena on Hugging Face. It provides an overview of various language models, ranks them and allows for direct comparisons of their performance.

Screenshot taken by the author: We can see a comparison of different llm models in the LLM Arena.

Multimodal models

If we look at the GPT-3 model, it can only process pure text. Multimodal models now go one step further: They can process and generate not only text, but also images, audio and video. In other words, they can process and generate several types of data at the same time.

What does this mean in concrete terms?

Multimodal models process different types of input (e.g. an image and a question about it) and combine this information to provide more intelligent answers. For example, with the Gemini 1.5 version you can upload a photo with different ingredients and ask the question which ingredients you see on this plate.

How does this work technically?

Multimodal models understand not only speech but also visual or auditory information. Multimodal models are also usually based on transformer architecture like pure text models. However, an important difference is that not only words are processed as ‘tokens’ but also images as so-called patches. These are small image sections that are converted into vectors and can then be processed by the model.

Let’s have a look at some examples:

GPT-4-Vision
This model from OpenAI can process text and images. It recognises content on images and combines it with speech.
Gemini 1.5
Google’s model can process text, images, audio and video. It is particularly strong at retaining context across modalities.
Claude 3
Anthropic’s model can process text and images and is very good at visual reasoning. It is good at recognising diagrams, graphics and handwriting.

Other examples are Flamingo from DeepMind, Kosmos-2 from Microsoft or Grok (xAI) from Elon Musk’s xAI, which is integrated into Twitter.

GPU & Compute Providers

When generative AI models are trained, this requires enormous computing capacity. Especially for pre-training but also for inference – the subsequent application of the model to new inputs.

Imagine a musician practising for months to prepare for a concert – that’s what pre-training is like. During pre-training, a model such as GPT-4, Claude 3, LLaMA 3 or DeepSeek-VL learns from trillions of tokens that come from texts, code, images and other sources. These data volumes are processed with GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units). This is necessary because this hardware enables parallel computing (compared to CPUs). Many companies rent computing power in the cloud (e.g. via AWS, Google Cloud, Azure) instead of operating their own servers.

When a pre-trained model is adapted to specific tasks with fine-tuning, this in turn, requires a lot of computing power. This is one of the major differences when the model is customised with RAG. One way to make fine-tuning more resource-efficient is low-rank adaptation (LoRA). Here, small parts of the model are specifically retrained instead of the entire model being trained with new data.

If we stay with the music example, the inference is the moment when the actual live concert takes place, which has to be played over and over again. This example also makes it clear that this also requires resources. Inference is the process of applying an AI model to a new input (e.g. you ask a question to ChatGPT) to generate an answer or a prediction.

Some examples:

Specialised hardware components that are optimised for parallel computing are used for this. For example, NVIDIA’s A100 and H100 GPUs are standard in many data centres. AMD Instinct MI300X, for example, are also catching up as a high-performance alternative. Google TPUs are also used for certain workloads – especially in the Google ecosystem.

ML Frameworks & Libraries

Just like in programming languages or web development, there are frameworks for AI tasks. For example, they provide ready-made functions for building neural networks without the need to program everything from scratch. Or they make training more efficient by parallelising calculations with the framework and making efficient use of GPUs.

The most important ML frameworks for generative AI:

PyTorch was developed by Meta and is open source. It is very flexible and popular in research & open source.
TensorFlow was developed by Google and is very powerful for large AI models. It supports distributed training – explanation and is often used in cloud environments.
Keras is a part of TensorFlow and is mainly used for beginners and prototype development.
JAX is also from Google and was specially developed for high-performance AI calculations. It is often used for advanced research and Google DeepMind projects. For example, it is used for the latest Google AI models such as Gemini and Flamingo.

PyTorch and TensorFlow can easily be combined with other tools such as Hugging Face Transformers or ONNX Runtime.

AI Application Frameworks

These frameworks enable us to integrate the Foundation Models into specific applications. They simplify access to the Foundation Models, the management of prompts and the efficient administration of AI-supported workflows.

Three tools, as examples:

LangChain enables the orchestration of LLMs for applications such as chatbots, document processing and automated analyses. It supports access to APIs, databases and external storage. And it can be connected to vector databases – which I explain in the next section – to perform contextual queries.

Let’s look at an example: A company wants to build an internal AI assistant that searches through documents. With LangChain, it can now connect GPT-4 to the internal database and the user can search company documents using natural language.
LlamaIndex was specifically designed to make large amounts of unstructured data efficiently accessible to LLMs and is therefore important for Retrieval Augmented Generation (RAG). Since LLMs only have a limited knowledge base based on the training data, it allows RAG to retrieve additional information before generating an answer. And this is where LlamaIndex comes into play: it can be used to convert unstructured data, e.g. from PDFs, websites or databases, into searchable indices.

Let’s take a look at a concrete example:

A lawyer needs a legal AI assistant to search laws. LlamaIndex organises thousands of legal texts and can therefore provide precise answers quickly.
Ollama makes it possible to run large language models on your own laptop or server without having to rely on the cloud. No API access is required as the models run directly on the device.

For example, you can run a model such as Mistral, LLaMA 3 or DeepSeek locally on your device.

Databases & Vector Stores

In traditional data processing, relational databases (SQL databases) store structured data in tables, while NoSQL databases such as MongoDB or Cassandra are used to store unstructured or semi-structured data.

With LLMs, however, we now also need a way to store and search semantic information.

This requires vector databases: A foundation model does not process input as text, but converts it into numerical vectors – so-called embeddings. Vector databases make it possible to perform fast similarity and memory management for embeddings and thus provide relevant contextual information.

How does this work, for example, with Retrieval Augmented Generation?

Each text (e.g. a paragraph from a PDF) is translated into a vector.
You pass a query to the model as a prompt. For example, you ask a question. This question is now also translated into a vector.
The database now calculates which vectors are closest to the input vector.
These top results are made available to the LLM before it answers. And the model then uses this information additionally for the answer.

Examples of this are Pinecone, FAISS, Weaviate, Milvus, and Qdrant.

Programming Languages

Generative AI development also needs a programming language.

Of course, Python is probably the first choice for almost all AI applications. Python has established itself as the main language for AI & ML and is one of the most popular and widely used languages. It is flexible and offers a large AI ecosystem with all the previously mentioned frameworks such as TensorFlow, PyTorch, LangChain or LlamaIndex.

Why isn’t Python used for everything?

Python is not very fast. But thanks to CUDA backends, TensorFlow or PyTorch are still very performant. However, if performance is really very important, Rust, C++ or Go are more likely to be used.

Another language that must be mentioned is Rust: This language is used when it comes to fast, secure and memory-efficient AI infrastructures. For example, for efficient databases for vector searches or high-performance network communication. It is primarily used in the infrastructure and deployment area.

Julia is a language that is close to Python, but much faster – this makes it perfect for numerical calculations and tensor operations.

TypeScript or JavaScript are not directly relevant for AI applications but are often used in the front end of LLM applications (e.g., React or Next.js).

Own visualization — Illustrations from unDraw.co

2 Scaling AI: Infrastructure and Compute Power

Apart from the core components, we also need ways to scale and train the models.

Containers & Orchestration

Not only traditional applications, but also AI applications need to be provided and scaled. I wrote about containerisation in detail in this article Why Data Scientists Should Care about Containers – and Stand Out with This Knowledge. But at its core, the point is that with containers, we can run an AI model (or any other application) on any server and it works the same. This allows us to provide consistent, portable and scalable AI workloads.

Docker is the standard for containerisation. Generative AI is no different. We can use it to develop AI applications as isolated, repeatable units. Docker is used to deploy LLMs in the cloud or on edge devices. Edge means that the AI does not run in the cloud, but locally on your device. The Docker images contain everything you need: Python, ML frameworks such as PyTorch, CUDA for GPUs and AI APIs.

Let’s take a look at an example: A developer trains a model locally with PyTorch and saves it as a Docker container. This allows it to be easily deployed to AWS or Google Cloud.

Kubernetes is there to manage and scale container workloads. It can manage GPUs as resources. This makes it possible to run multiple models efficiently on a cluster – and to scale automatically when demand is high.

Kubeflow is less well-known outside of the AI world. It allows ML models to be orchestrated as a workflow from data processing to deployment. It is specifically designed for machine learning in production environments and supports automatic model training & hyperparameter training.

Chip manufacturers & AI hardware

The immense computing power that is required must be produced. This is done by chip manufacturers. Powerful hardware reduces training times and improves model inference.

There are now also some models that have been trained with fewer parameters or fewer resources for the same performance. When DeepSeek was published at the end of February, it was somewhat questioned how many resources are actually necessary. It is becoming increasingly clear that huge models and extremely expensive hardware are not always necessary.

Probably the best-known chip manufacturer in the field of AI is Nvidia, one of the most valuable companies. With its specialised A100 and H100 GPUs, the company has become the de facto standard for training and inferencing large AI models. In addition to Nvidia, however, there are other important players such as AMD with its Instinct MI300X series, Google, Amazon and Cerebras.

API Providers for Foundation Models

The Foundation Models are pre-trained models. We use APIs so that we can access them as quickly as possible without having to host them ourselves. API providers offer quick access to the models, such as OpenAI API, Hugging Face Inference Endpoints or Google Gemini API. To do this, you send a text via an API and receive the response back. However, APIs such as the OpenAI API are subject to a fee.

The best-known provider is OpenAI, whose API provides access to GPT-3.5, GPT-4, DALL-E for image generation and Whisper for speech-to-text. Anthropic also offers a powerful alternative with Claude 2 and 3. Google provides access to multimodal models such as Gemini 1.5 via the Gemini API.

Hugging Face is a central hub for open source models: the inference endpoints allow us to directly address Mistral 7B, Mixtral or Meta models, for example.

Another exciting provider is Cohere, which provides Command R+, a model specifically for Retrieval Augmented Generation (RAG) – including powerful embedding APIs.

Serverless AI architectures

Serverless computing does not mean that there is no server but that you do not need your own server. You only define what is to be executed – not how or where. The cloud environment then automatically starts an instance, executes the code and shuts the instance down again. The AWS Lambda functions, for example, are well-known here.

Something similar is also available specifically for AI. Serverless AI reduces the administrative effort and scales automatically. This is ideal, for example, for AI tasks that are used irregularly.

Let’s take a look at an example: A chatbot on a website that answers questions from customers doesn’t have to run all the time. However, when a visitor comes to the website and asks a question, it must have resources. It is, therefore, only called up when needed.

Serverless AI can save costs and reduce complexity. However, it is not useful for continuous, latency-critical tasks.

Examples: AWS Bedrock, Azure OpenAI Service, Google Cloud Vertex AI

With great power and capability comes responsibility. The more we integrate AI into our everyday applications, the more important it becomes to engage with the principles of Responsible AI.

So…Generative AI raises many questions:

Does the model explain how it arrives at its answers?
-> Question about Transparency
Are certain groups favoured?
-> Question about Fairness
How is it ensured that the model is not misused?
-> Question about Security
Who is liable for errors?
-> Question about Accountability
Who controls how and where AI is used?
-> Question about Governance
Which available data from the web (e.g. images from
artists) may be used?
-> Question about Copyright / data ethics

While we have comprehensive regulations for many areas of the physical world — such as noise control, light pollution, vehicles, buildings, and alcohol sales — similar regulatory efforts in the IT sector are still rare and often avoided.

I’m not making a generalisation or a value judgment about whether this is good or bad. Less regulation can accelerate innovation – new technologies reach the market faster. At the same time, there is a risk that important aspects such as ethical responsibility, bias detection or energy consumption by large models will receive too little attention.

With the AI Act, the EU is focusing more on a regulated approach that is intended to create clear framework conditions – but this, in turn, can reduce the speed of innovation. The USA tends to pursue a market-driven, liberal approach with voluntary guidelines. This promotes rapid development but often leaves ethical and social issues in the background.

Let’s take a look at three concepts:

Explainability

Many large LLMs such as GPT-4 or Claude 3 are considered so-called black boxes: they provide impressive answers, but we do not know exactly how they arrive at these results. The more we entrust them with – especially in sensitive areas such as education, medicine or justice – the more important it becomes to understand their decision-making processes.

Tools such as LIME, SHAP or Attention Maps are ways of minimising these problems. They analyse model decisions and present them visually. In addition, model cards (standardised documentation) help to make the capabilities, training data, limitations and potential risks of a model transparent.

Fairness

If a model has been trained with data that contains biases or biased representations, it will also inherit these biases and distortions. This can lead to certain population groups being systematically disadvantaged or stereotyped. There are methods for recognising bias and clear standards for how training data should be selected and tested.

Governance

Finally, the question of governance arises: Who actually determines how AI may be used? Who checks whether a model is being operated responsibly?

4 Emerging Abilities: When AI Starts to Interact and Act

This is about the new capabilities that go beyond the classic prompt-response model. AI is becoming more active, more dynamic and more autonomous.

Let’s take a look at a concrete example:

A classic LLM like GPT-3 follows the typical process: For example, you ask a question like ‘Please show me how to create a button with rounded corners using HTML & CSS’. The model then provides you with the appropriate code, including a brief explanation. The model returns a pure text output without the model actively executing or thinking anything further.

Screenshot taken by the author: The answer from ChatGPT if we ask for creating buttons with rounded corners.

AI agents go much further. They not only analyse the prompt but also develop plans independently, access external tools or APIs and can complete tasks in several steps.

A simple example:

Instead of just writing the template for an email, an agent can monitor a data source and independently send an email as soon as a certain event occurs. For example, an email could go out when a sales target has been exceeded.

AI agents

AI agents are an application logic based on the Foundation Models. They orchestrate decisions and execute steps independently. Agents such as AutoGPT carry out multi-step tasks independently. They think in loops and try to improve or achieve a goal step by step.

Some examples:

Your AI agent analyzes new market reports daily, summarizes them, stores them in a database, and notifies the user in case of deviations.
An agent initiates a job application process: It scans submitted profiles and matches them with job offers.
In an e-commerce shop, the agent monitors inventory levels and customer demand. If a product is running low, it automatically reorders it – including price comparisons between suppliers.

What typically makes up an AI agent?

An AI agent consists of several specialized components, making it possible to autonomously plan, execute, and learn tasks:

Large Language Model
The LLM is the core or thinking engine. Typical models include GPT-4, Claude 3, Gemini 1.5, or Mistral 7B.
Planning unit
The planner transforms a higher-level goal into a concrete plan or sequence of steps. Often based on methods like Chain-of-Thought or ReAct.
Tool access
This component enables the agent to use external tools. For example, using a browser for extended search, a Python environment for code execution or enabling access to APIs and databases.
Memory
This component stores information about previous interactions, intermediate results, or contextual knowledge. This is necessary so that the agent can act consistently across multiple steps.
Executor
This component executes the planned steps in the correct order, monitors progress, and replans in case of errors.

There are also tools like Make or n8n (low-code / no-code automation platforms), which also let you implement “agent-like” logic. They execute workflows with conditions, triggers, and actions. For example, an automated reply should be formulated when a new email arrives in the inbox. And there are a lot of templates for such use cases.

Screenshot taken by the author: Templates on n8n as an example for low-code or no-code platforms.

Reinforcement Learning

With reinforcement learning, the models are made more “human-friendly.” In this training method, the model learns through reward. This is especially important for tasks where there is no clear “right” or “wrong,” but rather gradual quality.

An example of this is when you use ChatGPT, receive two different responses and are asked to rate which one you prefer.

The reward can come either from human feedback (Reinforcement Learning from Human Feedback – RLHF) or from another model (Reinforcement Learning from AI Feedback – RLVR). In RLHF, a human rates several responses from a model, allowing the LLM to learn what “good” responses look like and better align with human expectations. In RLVR, the model doesn’t just receive binary feedback (e.g., good vs. bad) but differentiated, context-dependent rewards (e.g., a variable reward scale from -1 to +3). RLVR is especially useful where there are many possible “good” responses, but some match the user’s intent much better.

On my Substack, I regularly write summaries about the published articles in the fields of Tech, Python, Data Science, Machine Learning and AI. If you’re interested, take a look or subscribe.

Final Thoughts

It would probably be possible to write an entire book about Generative Ai right now – not just a single article. Artificial intelligence has been researched and applied for many years. But we are currently in a moment where an explosion of tools, applications, and frameworks is happening – AI, and especially generative AI, has truly arrived in our everyday lives. Let’s see where this takes us and end with a quote from Alan Kay:

The best way to predict the future is to invent it.

Where Can You Continue Learning?

The post Understanding the Tech Stack Behind Generative AI appeared first on Towards Data Science.

The Art of Hybrid Architectures

Eric Chung — Sat, 29 Mar 2025 03:38:17 +0000

In my previous article, I discussed how morphological feature extractors mimic the way biological experts visually assess images.

This time, I want to go a step further and explore a new question:
Can different architectures complement each other to build an AI that “sees” like an expert?

Introduction: Rethinking Model Architecture Design

While building a high accuracy visual recognition model, I ran into a key challenge:

How do we get AI to not just “see” an image, but actually understand the features that matter?

Traditional CNNs excel at capturing local details like fur texture or ear shape, but they often miss the bigger picture. Transformers, on the other hand, are great at modeling global relationships, how different regions of an image interact, but they can easily overlook fine-grained cues.

This insight led me to explore combining the strengths of both architectures to create a model that not only captures fine details but also comprehends the bigger picture.

While developing PawMatchAI, a 124-breed dog classification system, I went through three major architectural phases:

1. Early Stage: EfficientNetV2-M + Multi-Head Attention

I started with EfficientNetV2-M and added a multi-head attention module.

I experimented with 4, 8, and 16 heads—eventually settling on 8, which gave the best results.

This setup reached an F1 score of 78%, but it felt more like a technical combination than a cohesive design.

2. Refinement: Focal Loss + Advanced Data Augmentation

After closely analyzing the dataset, I noticed a class imbalance, some breeds appeared far more frequently than others, skewing the model’s predictions.

To address this, I introduced Focal Loss, along with RandAug and mixup, to make the data distribution more balanced and diverse.
This pushed the F1 score up to 82.3%.

3. Breakthrough: Switching to ConvNextV2-Base + Training Optimization

Next, I replaced the backbone with ConvNextV2-Base, and optimized the training using OneCycleLR and a progressive unfreezing strategy.
The F1 score climbed to 87.89%.

But during real-world testing, the model still struggled with visually similar breeds, indicating room for improvement in generalization.

4. Final Step: Building a Truly Hybrid Architecture

After reviewing the first three phases, I realized the core issue: stacking technologies isn’t the same as getting them to work together.

What I needed was true collaboration between the CNN, the Transformer, and the morphological feature extractor, each playing to its strengths. So I restructured the entire pipeline.

ConvNextV2 was in charge of extracting detailed local features.
The morphological module acted like a domain expert, highlighting features critical for breed identification.

Finally, the multi-head attention brought it all together by modeling global relationships.

This time, they weren’t just independent modules, they were a team.
CNNs identified the details, the morphology module amplified the meaningful ones, and the attention mechanism tied everything into a coherent global view.

Key Result: The F1 score rose to 88.70%, but more importantly, this gain came from the model learning to understand morphology, not just memorize textures or colors.

It started recognizing subtle structural features—just like a real expert would—making better generalizations across visually similar breeds.

If you’re interested, I’ve written more about morphological feature extractors here.

These extractors mimic how biological experts assess shape and structure, enhancing critical visual cues like ear shape and body proportions.

They’re a vital part of this hybrid design, filling the gaps traditional models tend to overlook.

In this article, I’ll walk through:

The strengths and limitations of CNNs vs. Transformers—and how they can complement each other
Why I ultimately chose ConvNextV2 over EfficientNetV2
The technical details of multi-head attention and how I decided the number of heads
How all these elements came together in a unified hybrid architecture
And finally, how heatmaps reveal that the AI is learning to “see” key features, just like a human expert

1. The Strengths and Limitations of CNNs and Transformers

In the previous section, I discussed how CNNs and Transformers can effectively complement each other. Now, let’s take a closer look at what sets each architecture apart, their individual strengths, limitations, and how their differences make them work so well together.

1.1 The Strength of CNNs: Great with Details, Limited in Scope

CNNs are like meticulous artists, they can draw fine lines beautifully, but often miss the bigger composition.

Strong at Local Feature Extraction
CNNs are excellent at capturing edges, textures, and shapes—ideal for distinguishing fine-grained features like ear shapes, nose proportions, and fur patterns across dog breeds.

Computational Efficiency
With parameter sharing, CNNs process high-resolution images more efficiently, making them well-suited for large-scale visual tasks.

Translation Invariance
Even when a dog’s pose varies, CNNs can still reliably identify its breed.

That said, CNNs have two key limitations:

Limited Receptive Field:
CNNs expand their field of view layer by layer, but early-stage neurons only “see” small patches of pixels. As a result, it’s difficult for them to connect features that are spatially far apart.

For instance: When identifying a German Shepherd, the CNN might spot upright ears and a sloped back separately, but struggle to associate them as defining characteristics of the breed.

Lack of Global Feature Integration:
CNNs excel at local stacking of features, but they’re less adept at combining information from distant regions.

Example: To distinguish a Siberian Husky from an Alaskan Malamute, it’s not just about one feature, it’s about the combination of ear shape, facial proportions, tail posture, and body size. CNNs often struggle to consider these elements holistically.

1.2 The Strength of Transformers: Global Awareness, But Less Precise

Transformers are like master strategists with a bird’s-eye view, they quickly spot patterns, but aren’t great at filling in the fine details.

Capturing Global Context
Thanks to their self-attention mechanism, Transformers can directly link any two features in an image, no matter how far apart they are.

Dynamic Attention Weighting
Unlike CNNs’ fixed kernels, Transformers dynamically allocate focus based on context.

Example: When identifying a Poodle, the model may prioritize fur texture; when it sees a Bulldog, it might focus more on facial structure.

But Transformers also have two major drawbacks:

High Computational Cost:
Self-attention has a time complexity of O(n²). As image resolution increases, so does the cost—making training more intensive.

Weak at Capturing Fine Details:
Transformers lack CNNs’ “built-in intuition” that nearby pixels are usually related.

Example: On their own, Transformers might miss subtle differences in fur texture or eye shape, details that are crucial for distinguishing visually similar breeds.

1.3 Why a Hybrid Architecture Is Necessary

Let’s take a real world case:

How do you distinguish a Golden Retriever from a Labrador Retriever?

They’re both beloved family dogs with similar size and temperament. But experts can easily tell them apart by observing:

Golden Retrievers have long, dense coats ranging from golden to dark gold, more elongated heads, and distinct feathering around ears, legs, and tails.
Labradors, on the other hand, have short, double-layered coats, more compact bodies, rounder heads, and thick otter-like tails. Their coats come in yellow, chocolate, or black.

Interestingly, for humans, this distinction is relatively easy, “long hair vs. short hair” might be all you need.

But for AI, relying solely on coat length (a texture-based feature) is often unreliable. Lighting, image quality, or even a trimmed Golden Retriever can confuse the model.

When analyzing this challenge, we can see…

The problem with using only CNNs:

While CNNs can detect individual features like “coat length” or “tail shape,” they struggle with combinations like “head shape + fur type + body structure.” This issue worsens when the dog is in a different pose.

The problem with using only Transformers:

Transformers can associate features across the image, but they’re not great at picking up fine-grained cues like slight variations in fur texture or subtle head contours. They also require large datasets to achieve expert-level performance.
Plus, their computational cost increases sharply with image resolution, slowing down training.

These limitations highlight a core truth:

Fine-grained visual recognition requires both local detail extraction and global relationship modeling.

A truly expert system like a veterinarian or show judge must inspect features up close while understanding the overall structure. That’s exactly where hybrid architectures shine.

1.4 The Advantages of a Hybrid Architecture

This is why we need hybrid systems architectures that combine CNNs’ precision in local features with Transformers’ ability to model global relationships:

CNNs: Extract local, fine-grained features like fur texture and ear shape, crucial for spotting subtle differences.
Transformers: Capture long-range dependencies (e.g., head shape + body size + eye color), allowing the model to reason holistically.
Morphological Feature Extractors: Mimic human expert judgment by emphasizing diagnostic features, bridging the gap left by data-driven models.

Such an architecture not only boosts evaluation metrics like the F1 Score, but more importantly, it enables the AI to genuinely understand the subtle distinctions between breeds, getting closer to the way human experts think. The model learns to weigh multiple features together, instead of over-relying on one or two unstable cues.

In the next section, I’ll dive into how I actually built this hybrid architecture, especially how I selected and integrated the right components.

2. Why I Chose ConvNextV2: Key Innovations Behind the Backbone

Among the many visual recognition architectures available, why did I choose ConvNextV2 as the backbone of my project?

Because its design effectively combines the best of both worlds: the CNN’s ability to extract precise local features, and the Transformer’s strength in capturing long-range dependencies.

Let’s break down three core innovations that made it the right fit.

2.1 FCMAE Self-Supervised Learning: Adaptive Learning Inspired by the Human Brain

Imagine learning to navigate with your eyes covered, your brain becomes laser-focused on memorizing the details you can perceive.

ConvNextV2 uses a self-supervised pretraining strategy similar to that of Vision Transformers.

During training, up to 60% of input pixels are intentionally masked, and the model must learn to reconstruct the missing regions.
This “make learning harder on purpose” approach actually leads to three major benefits:

Comprehensive Feature Learning
The model learns the underlying structure and patterns of an image—not just the most obvious visual cues.
In the context of breed classification, this means it pays attention to fur texture, skeletal structure, and body proportions, instead of relying solely on color or shape.
Reduced Dependence on Labeled Data
By pretraining on unlabeled dog images, the model develops strong visual representations.
Later, with just a small amount of labeled data, it can fine-tune effectively—saving significant annotation effort.
Improved Recognition of Rare Patterns
The reconstruction task pushes the model to learn generalized visual rules, enhancing its ability to identify rare or underrepresented breeds.

2.2 GRN Global Calibration: Mimicking an Expert’s Attention

Like a seasoned photographer who adjusts the exposure of each element to highlight what truly matters.

GRN (Global Response Normalization) is arguably the most impactful innovation in ConvNextV2, giving CNNs a degree of global awareness that was previously lacking:

Dynamic Feature Recalibration
GRN globally normalizes the feature map, amplifying the most discriminative signals while suppressing irrelevant ones.
For instance, when identifying a German Shepherd, it emphasizes upright ears and the sloped back while minimizing background noise.
Enhanced Sensitivity to Subtle Differences
This normalization sharpens feature contrast, making it easier to spot fine-grained differences—critical for telling apart breeds like the Siberian Husky and Alaskan Malamute.
Focus on Diagnostic Features
GRN helps the model prioritize features that truly matter for classification, rather than relying on statistically correlated but causally irrelevant cues.

2.3 Sparse and Efficient Convolutions: More with Less

Like a streamlined team where each member plays to their strengths, reducing redundancy while boosting performance.

ConvNextV2 incorporates architectural optimizations such as depthwise separable convolutions and sparse connections, resulting in three major gains:

Improved Computational Efficiency
By breaking down convolutions into smaller, more efficient steps, the model reduces its computational load.
This allows it to process high-resolution dog images and detect fine visual differences without requiring excessive resources.
Expanded Effective Receptive Field
The layout of convolutions is designed to extend the model’s field of view, helping it analyze both overall body structure and local details simultaneously.
Parameter Efficiency
The architecture ensures that each parameter carries more learning capacity, extracting richer, more nuanced information using the same amount of compute.

2.4 Why ConvNextV2 Was the Right Fit for a Hybrid Architecture

ConvNextV2 turned out to be the perfect backbone for this hybrid system, not just because of its performance, but because it embodies the very philosophy of fusion.

It retains the local precision of CNNs while adopting key design concepts from Transformers to expand its global awareness. This duality makes it a natural bridge between CNNs and Transformers apable of preserving fine-grained details while understanding the broader context.

It also lays the groundwork for additional modules like multi-head attention and morphological feature extractors, ensuring the model starts with a complete, balanced feature set.

In short, ConvNextV2 doesn’t just “see the parts”, it starts to understand how the parts come together. And in a task like dog breed classification, where both minute differences and overall structure matter, this kind of foundation is what transforms an ordinary model into one that can reason like an expert.

3. Technical Implementation of the MultiHeadAttention Mechanism

In neural networks, the core concept of the attention mechanism is to enable models to “focus” on key parts of the input, similar to how human experts consciously focus on specific features (such as ear shape, muzzle length, tail posture) when identifying dog breeds.
The Multi-Head Attention (MHA) mechanism further enhances this ability:

“Rather than having one expert evaluate all features, it’s better to form a panel of experts, letting each focus on different details, and then synthesize a final judgment!”

Mathematically, MHA uses multiple linear projections to allow the model to simultaneously learn different feature associations, further enhancing performance.

3.1 Understanding MultiHeadAttention from a Mathematical Perspective

The core idea of MultiHeadAttention is to use multiple different projections to allow the model to simultaneously attend to patterns in different subspaces. Mathematically, it first projects input features into three roles: Query, Key, and Value, then calculates the similarity between Query (Q) and Key (K), and uses this similarity to perform weighted averaging of Values.

The basic formula can be expressed as:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

3.2 Application of Einstein Summation Convention in Attention Calculation

In the implementation, I used the torch.einsum function based on the Einstein summation convention to efficiently calculate attention scores:

energy = torch.einsum("nqd,nkd->nqk", [q, k])

This means:
q has shape (batch_size, num_heads, query_dim)
k has shape (batch_size, num_heads, key_dim)
The dot product is performed on dimension d, resulting in (batch_size, num_heads, query_len, key_len) This is essentially “calculating similarity between each Query and all Keys,” generating an attention weight matrix

3.3 Implementation Code Analysis

Key implementation code for MultiHeadAttention:

def forward(self, x):

    N = x.shape[0]  # batch size

    # 1. Project input, prepare for multi-head attention calculation
    x = self.fc_in(x)  # (N, input_dim) → (N, scaled_dim)

    # 2. Calculate Query, Key, Value, and reshape into multi-head form
    q = self.query(x).view(N, self.num_heads, self.head_dim)  # query
    k = self.key(x).view(N, self.num_heads, self.head_dim)    # key
    v = self.value(x).view(N, self.num_heads, self.head_dim)  # value

    # 3. Calculate attention scores (similarity matrix)
    energy = torch.einsum("nqd,nkd->nqk", [q, k])

    # 4. Apply softmax (normalize weights) and perform scaling
    attention = F.softmax(energy / (self.head_dim ** 0.5), dim=2)

    # 5. Use attention weights to perform weighted sum on Value
    out = torch.einsum("nqk,nvd->nqd", [attention, v])

    # 6. Rearrange output and pass through final linear layer
    out = out.reshape(N, self.scaled_dim)
    out = self.fc_out(out)

    return out

3.3.1. Steps 1-2: Projection and Multi-Head Splitting
First, input features are projected through a linear layer, and then separately projected into query, key, and value spaces. Importantly, these projections not only change the feature representation but also split them into multiple “heads,” each attending to different feature subspaces.

3.3.2. Steps 3-4: Attention Calculation

3.3.3. Steps 5-6: Weighted Aggregation and Output Projection
Using the calculated attention weights, weighted summation is performed on the value vectors to obtain the attended feature representation. Finally, outputs from all heads are concatenated and passed through an output projection layer to get the final result.

This implementation has the following simplifications and adjustments compared to standard Transformer MultiHeadAttention: Query, key, and value come from the same input (self-attention), suitable for processing features obtained from CNN backbone networks.

It uses einsum operations to simplify matrix calculations.

The design of projection layers ensures dimensional consistency, facilitating integration with other modules.

3.4 How Attention Mechanisms Enhance Understanding of Morphological Feature Relationships

The multi-head attention mechanism brings three core advantages to dog breed recognition:

3.4.1. Feature Relationship Modeling

Just as a professional veterinarian not only sees that ears are upright but also notices how this combines with tail curl degree and skull shape to form a dog breed’s “feature combination.”

It can establish associations between different morphological features, capturing their synergistic relationships, not just seeing “what features exist” but observing “how these features combine.”

Application: The model can learn that a combination of “pointed ears + curled tail + medium build” points to specific Northern dog breeds.

3.4.2. Dynamic Feature Importance Assessment

Just as experts know to focus particularly on fur texture when identifying Poodles, while focusing mainly on the distinctive nose and head structure when identifying Bulldogs.

It dynamically adjusts focus on different features based on the specific content of the input.

Key features vary across different breeds, and the attention mechanism can adaptively focus.

Application: When seeing a Border Collie, the model might focus more on fur color distribution; when seeing a Dachshund, it might focus more on body proportions

3.4.3. Complementary Information Integration

Like a team of experts with different specializations, one focusing on skeletal structure, another on fur features, another analyzing behavioral posture, making a more comprehensive judgment together.

Through multiple attention heads, each simultaneously captures different types of feature relationships. Each head can focus on a specific type of feature or relationship pattern.

Application: One head might primarily focus on color patterns, another on body proportions, and yet another on facial features, ultimately synthesizing these perspectives to make a judgment.

By combining these three capabilities, the MultiHeadAttention mechanism goes beyond identifying individual features, it learns to model the complex relationships between them, capturing subtle patterns that emerge from their combinations and enabling more accurate recognition.

4. Implementation Details of the Hybrid Architecture

4.1 The Overall Architectural Flow

When designing this hybrid architecture, my goal was simple yet ambitious:

Let each component do what it does best, and build a complementary system where they enhance one another.

Much like a well-orchestrated symphony, each instrument (or module) plays its role, only together can they create harmony.
In this setup:

The CNN focuses on capturing local details.
The morphological feature extractor enhances key structural features.
The multi-head attention module learns how these features interact.

As shown in the diagram above, the overall model operates through five key stages:

4.1.1. Feature Extraction

Once an image enters the model, ConvNextV2 takes charge of extracting foundational features, such as fur color, contours, and texture. This is where the AI begins to “see” the basic shape and appearance of the dog.

4.1.2. Morphological Feature Enhancement

These initial features are then refined by the morphological feature extractor. This module functions like an expert’s eye—highlighting structural characteristics such as ear shape and body proportions. Here, the AI learns to focus on what actually matters.

4.1.3. Feature Fusion

Next comes the feature fusion layer, which merges the local features with the enhanced morphological cues. But this isn’t just a simple concatenation, the layer also models how these features interact, ensuring the AI doesn’t treat them in isolation, but rather understands how they combine to convey meaning.

4.1.4. Feature Relationship Modeling

The fused features are passed into the multi-head attention module, which builds contextual relationships between different attributes. The model begins to understand combinations like “ear shape + fur texture + facial proportions” rather than looking at each trait independently.

4.1.5. Final Classification

After all these layers of processing, the model moves to its final classifier, where it makes a prediction about the dog’s breed, based on the rich, integrated understanding it has developed.

4.2 Integrating ConvNextV2 and Parameter Setup

For implementation, I chose the pretrained ConvNextV2-base model as the backbone:

self.backbone = timm.create_model(
    'convnextv2_base',
    pretrained=True,
    num_classes=0)  # Use only the feature extractor; remove original classification head

Depending on the input image size or backbone architecture, the feature output dimensions may vary. To build a robust and flexible system, I designed a dynamic feature dimension detection mechanism:

with torch.no_grad():
    dummy_input = torch.randn(1, 3, 224, 224)
    features = self.backbone(dummy_input)
    if len(features.shape) > 2:
        features = features.mean([-2, -1])  # Global average pooling to produce a 1D feature vector
    self.feature_dim = features.shape[1]

This ensures the system automatically adapts to any feature shape changes, keeping all downstream components functioning properly.

4.3 Intelligent Configuration of the Multi-Head Attention Layer

As mentioned earlier, I experimented with several head counts. Too many heads increased computation and risked overfitting. I ultimately settled on eight, but allowed the number of heads to adjust automatically based on feature dimensions:

self.num_heads = max(1, min(8, self.feature_dim // 64))
self.attention = MultiHeadAttention(self.feature_dim, num_heads=self.num_heads)

4.4 Making CNN, Transformers, and Morphological Features Work Together

The morphological feature extractor works hand-in-hand with the attention mechanism.

While the former provides structured representations of key traits, the latter models relationships among these features:

# Feature fusion
combined_features = torch.cat([
    features,  # Base features
    morphological_features,  # Morphological features
    features * morphological_features  # Interaction between features
], dim=1)
fused_features = self.feature_fusion(combined_features)

# Apply attention
attended_features = self.attention(fused_features)

# Final classification
logits = self.classifier(attended_features)

return logits, attended_features

A special note about the third component features * morphological_features — this isn’t just a mathematical multiplication. It creates a form of dialogue between the two feature sets, allowing them to influence each other and generate richer representations.

For example, suppose the model picks up “pointy ears” from the base features, while the morphological module detects a “small head-to-body ratio.”

Individually, these may not be conclusive, but their interaction may strongly suggest a specific breed, like a Corgi or Finnish Spitz. It’s no longer just about recognizing ears or head size, the model learns to interpret how features work together, much like an expert would.
This full pipeline from feature extraction, through morphological enhancement and attention-driven modeling, to prediction is my vision of what an ideal architecture should look like.

The design has several key advantages:

The morphological extractor brings structured, expert-inspired understanding.
The multi-head attention uncovers contextual relationships between traits.
The feature fusion layer captures nonlinear interactions through element-wise multiplication.

4.5 Technical Challenges and How I Solved Them

Building a hybrid architecture like this was far from smooth sailing.
Here are several challenges I faced and how solving them helped me improve the overall design:

4.5.1. Mismatched Feature Dimensions

Challenge: Output sizes varied across modules, especially when switching backbone networks.
Solution: In addition to the dynamic dimension detection mentioned earlier, I implemented adaptive projection layers to unify the feature dimensions.

4.5.2. Balancing Performance and Efficiency

Challenge: More complexity meant more computation.
Solution: I dynamically adjusted the number of attention heads, and used efficient einsum operations to optimize performance.

4.5.3. Overfitting Risk

Challenge: Hybrid models are more prone to overfitting, especially with smaller training sets.
Solution: I applied LayerNorm, Dropout, and weight decay for regularization.

4.5.4. Gradient Flow Issues

Challenge: Deep architectures often suffer from vanishing or exploding gradients.
Solution: I introduced residual connections to ensure gradients flow smoothly during both forward and backward passes.

If you’re interested in exploring the full implementation, feel free to check out the GitHub project here.

5. Performance Evaluation and Heatmap Analysis

The value of a hybrid architecture lies not only in its quantitative performance but also in how it qualitatively “thinks.”

In this section, we’ll use confidence score statistics and heatmap analysis to demonstrate how the model evolved from CNN → CNN+Transformer → CNN+Transformer+MFE, and how each stage brought its visual reasoning closer to that of a human expert.

To ensure that the performance differences came purely from architecture design, I retrained each model using the exact same dataset, augmentation methods, loss function, and training parameters. The only variation was the presence or absence of the Transformer and morphological modules.

In terms of F1 score, the CNN-only model reached 87.83%, the CNN+Transformer variant performed slightly better at 89.48%, and the final hybrid model scored 88.70%. While the transformer-only version showed the highest score on paper, it didn’t always translate into more reliable predictions. In fact, the hybrid model was more consistent in practice and handled similar-looking or blurry cases more reliably.

5.1 Confidence Scores and Statistical Insights

I tested 17 images of Border Collies, including standard photos, artistic illustrations, and various camera angles, to thoroughly assess the three architectures.

While other breeds were also included in the broader evaluation, I chose Border Collie as a representative case due to its distinctive features and frequent confusion with similar breeds.

Figure 1: Model Confidence Score Comparison
As shown above, there are clear performance differences across the three models.

A notable example is Sample #3, where the CNN-only model misclassified the Border Collie as a Collie, with a low confidence score of 0.2492.

While the CNN+Transformer corrected this error, it introduced a new one in Sample #5, misidentifying it as a Shiba Inu with 0.2305 confidence.

The final CNN+Transformer+MFE model correctly identified all samples without error. What’s interesting here is that both misclassifications occurred at low confidence levels (below 0.25).
This suggests that even when the model makes a mistake, it retains a sense of uncertainty—a desirable trait in real world applications. We want models to be cautious when unsure, rather than confidently wrong.

Figure 2: Confidence Score Distribution
Looking at the distribution of confidence scores, the improvement becomes even more evident.

The CNN-only model mostly predicted in the 0.4–0.5 range, with few samples reaching beyond 0.6.

CNN+Transformer showed better concentration around 0.5–0.6, but still had only one sample in the 0.7–0.8 high-confidence range.
The CNN+Transformer+MFE model stood out with 6 samples reaching the 0.7–0.8 confidence level.

This rightward shift in distribution reveals more than just accuracy, it reflects certainty.

The model is evolving from “barely correct” to “confidently correct,” which significantly enhances its reliability in real-world deployment.

Figure 3: Statistical Summary of Model Performance
A deeper statistical breakdown highlights consistent improvements:

Mean confidence score rose from 0.4639 (CNN) to 0.5245 (CNN+Transformer), and finally 0.6122 with the full hybrid setup—a 31.9% increase overall.

Median score jumped from 0.4665 to 0.6827, confirming the overall shift toward higher confidence.

The proportion of high-confidence predictions (≥ 0.5) also showed striking gains:

CNN: 41.18%
CNN+Transformer: 64.71%
CNN+Transformer+MFE: 82.35%

This means that with the final architecture, most predictions are not only correct but confidently correct.

You might notice a slight increase in standard deviation (from 0.1237 to 0.1616), which might seem like a negative at first. But in reality, this reflects a more nuanced response to input complexity:

The model is highly confident on easier samples, and appropriately cautious on harder ones. The improvement in maximum confidence value (from 0.6343 to 0.7746) further shows how this hybrid architecture can make more decisive and assured judgments when presented with straightforward samples.

5.2 Heatmap Analysis: Tracing the Evolution of Model Reasoning

While statistical metrics are helpful, they don’t tell the full story.
To truly understand how the model makes decisions, we need to see what it sees and heatmaps make this possible.

In these heatmaps, red indicates areas of high attention, highlighting the regions the model relies on most during prediction. By analyzing these attention maps, we can observe how each model interprets visual information, revealing fundamental differences in their reasoning styles.

Let’s walk through one representative case.

5.2.1 Frontal View of a Border Collie: From Local Eye Focus to Structured Morphological Understanding
When presented with a frontal image of a Border Collie, the three models reveal distinct attention patterns, reflecting how their architectural designs shape visual understanding.

The CNN-only model produces a heatmap with two sharp attention peaks, both centered on the dog’s eyes. This indicates a strong reliance on local features while overlooking other morphological traits like the ears or facial outline. While eyes are indeed important, focusing solely on them makes the model more vulnerable to variations in pose or lighting. The resulting confidence score of 0.5581 reflects this limitation.

With the CNN+Transformer model, the attention becomes more distributed. The heatmap forms a loose M-shaped pattern, extending beyond the eyes to include the forehead and the space between the eyes. This shift suggests that the model begins to understand spatial relationships between features, not just the features themselves. This added contextual awareness leads to a stronger confidence score of 0.6559.

The CNN+Transformer+MFE model shows the most structured and comprehensive attention map. The heat is symmetrically distributed across the eyes, ears, and the broader facial region. This indicates that the model has moved beyond feature detection and is now capturing how features are arranged as part of a meaningful whole. The Morphological Feature Extractor plays a key role here, helping the model grasp the structural signature of the breed. This deeper understanding boosts the confidence to 0.6972.

Together, these three heatmaps represent a clear progression in visual reasoning, from isolated feature detection, to inter-feature context, and finally to structural interpretation. Even though ConvNeXtV2 is already a powerful backbone, adding Transformer and MFE modules enables the model to not just see features but to understand them as part of a coherent morphological pattern. This shift is subtle but crucial, especially for fine-grained tasks like breed classification.

5.2.2 Error Case Analysis: From Misclassification to True Understanding

This is a case where the CNN-only model misclassified a Border Collie.

Looking at the heatmap, we can see why. The model focuses almost entirely on a single eye, ignoring most of the face. This kind of over-reliance on one local feature makes it easy to confuse breeds that share similar traits in this case, a Collie, which also has similar eye shape and color contrast.

What the model misses are the broader facial proportions and structural details that define a Border Collie. Its low confidence score of 0.2492 reflects that uncertainty.

With the CNN+Transformer model, attention shifts in a more promising direction. It now covers both eyes and parts of the forehead, creating a more balanced attention pattern. This suggests the model is beginning to connect multiple features, rather than depending on just one.

Thanks to self-attention, it can better interpret relationships between facial components, leading to the correct prediction — Border Collie. The confidence score rises to 0.5484, more than double the previous model’s.

The CNN+Transformer+MFE model takes this further by improving morphological awareness. The heatmap now extends to the nose and muzzle, capturing nuanced traits like facial length and mouth shape. These are subtle but important cues that help distinguish herding breeds from one another.

The MFE module seems to guide the model toward structural combinations, not just isolated features. As a result, confidence increases again to 0.5693, showing a more stable, breed-specific understanding.

This progression from a narrow focus on a single eye, to integrating facial traits, and finally to interpreting structural morphology, highlights how hybrid models support more accurate and generalizable visual reasoning.

In this example, the CNN-only model focuses almost entirely on one side of the dog’s face. The rest of the image is nearly ignored. This kind of narrow attention suggests the model didn’t have enough visual context to make a strong decision. It guessed correctly this time, but with a low confidence score of 0.2238, it’s clear that the prediction wasn’t based on solid reasoning.

The CNN+Transformer model shows a broader attention span, but it introduces a different issue, the heatmap becomes scattered. You can even spot a strong attention spike on the far right, completely unrelated to the dog. This kind of misplaced focus likely led to a misclassification as a Shiba Inu, and the confidence score was still low at 0.2305.

This highlights an important point:

Adding a Transformer doesn’t guarantee better judgment unless the model learns where to look. Without guidance, self-attention can amplify the wrong signals and create confusion rather than clarity.

With the CNN+Transformer+MFE model, the attention becomes more focused and structured. The model now looks at key regions like the eyes, nose, and chest, building a more meaningful understanding of the image. But even here, the confidence remains low at 0.1835, despite the correct prediction. This image clearly presented a real challenge for all three models.

That’s what makes this case so interesting.

It reminds us that a correct prediction doesn’t always mean the model was confident. In harder scenarios unusual poses, subtle features, cluttered backgrounds even the most advanced models can hesitate.

And that’s where confidence scores become invaluable.
They help flag uncertain cases, making it easier to design review pipelines where human experts can step in and verify tricky predictions.

5.2.3 Recognizing Artistic Renderings: Testing the Limits of Generalization

Artistic images pose a unique challenge for visual recognition systems. Unlike standard photos with crisp textures and clear lighting, painted artworks are often abstract and distorted. This forces models to rely less on superficial cues and more on deeper, structural understanding. In that sense, they serve as a perfect stress test for generalization.

Let’s see how the three models handle this scenario.

Starting with the CNN-only model, the attention map is scattered, with focus diffused across both sides of the image. There’s no clear structure — just a vague attempt to “see everything,” which usually means the model is unsure what to focus on. That uncertainty is reflected in its confidence score of 0.5394, sitting in the lower-mid range. The model makes the correct guess, but it’s far from confident.

Next, the CNN+Transformer model shows a clear improvement. Its attention sharpens and clusters around more meaningful regions, particularly near the eyes and ears. Even with the stylized brushstrokes, the model seems to infer, “this could be an ear” or “that looks like the facial outline.” It’s starting to map anatomical cues, not just visual textures. The confidence score rises to 0.6977, suggesting a more structured understanding is taking shape.

Finally, we look at the CNN+Transformer+MFE hybrid model. This one locks in with precision. The heatmap centers tightly on the intersection of the eyes and nose — arguably the most distinctive and stable region for identifying a Border Collie, even in abstract form. It’s no longer guessing based on appearance. It’s reading the dog’s underlying structure.

This leap is largely thanks to the MFE, which helps the model focus on features that persist, even when style or detail varies. The result? A confident score of 0.7457, the highest among all three.

This experiment makes something clear:

Hybrid models don’t just get better at recognition, they get better at reasoning.

They learn to look past visual noise and focus on what matters most: structure, proportion, and pattern. And that’s what makes them reliable, especially in the unpredictable, messy real world of images.

Conclusion

As deep learning evolves, we’ve moved from CNNs to Transformers—and now toward hybrid architectures that combine the best of both. This shift reflects a broader change in AI design philosophy: from seeking purity to embracing fusion.

Think of it like cooking. Great chefs don’t insist on one technique. They mix sautéing, boiling, and frying depending on the ingredient. Similarly, hybrid models combine different architectural “flavors” to suit the task at hand.

This fusion design offers several key benefits:

Complementary strengths: Like combining a microscope and a telescope, hybrid models capture both fine details and global context.
Structured understanding: Morphological feature extractors bring expert-level domain insights, allowing models not just to see, but to truly understand.
Dynamic adaptability: Future models might adjust internal attention patterns based on the image, emphasizing texture for spotted breeds, or structure for solid-colored ones.
Wider applicability: From medical imaging to biodiversity and art authentication, any task involving fine-grained visual distinctions can benefit from this approach.

This visual system—blending ConvNeXtV2, attention mechanisms, and morphological reasoning proves that accuracy and intelligence don’t come from any single architecture, but from the right combination of ideas.

Perhaps the future of AI won’t rely on one perfect design, but on learning to combine cognitive strategies just as the human brain does.

References & Data Source

Research References

Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
Dosovitskiy, A., et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Liu, Z., et al. (2022). ConvNeXt: A ConvNet for the 2020s. CVPR 2022
Liu, Z., et al. (2023). ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. CVPR 2023.
Rockt (2018). Einstein Summation Notation Explained Visually. rockt.github.io
Pytorch Org. torch.einsum

Dataset Sources

Stanford Dogs Dataset – Kaggle Dataset
Originally sourced from Stanford Vision Lab – ImageNet Dogs License: Non-commercial research and educational use only Citation: Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for Fine-Grained Image Categorization. FGVC Workshop, CVPR, 2011
Unsplash Images – Additional images of four breeds (Bichon Frise, Dachshund, Shiba Inu, Havanese) were sourced from Unsplash for dataset augmentation.

Thank you for reading. Through developing PawMatchAI, I’ve learned many valuable lessons about AI vision systems and feature recognition. If you have any perspectives or topics you’d like to discuss, I welcome the opportunity to exchange ideas.
Email
GitHub

Disclaimer

The methods and approaches described in this article are based on my personal research and experimental findings. While the Hybrid Architecture has demonstrated improvements in specific scenarios, its performance may vary depending on datasets, implementation details, and training conditions.

This article is intended for educational and informational purposes only. Readers should conduct independent evaluations and adapt the approach based on their specific use cases. No guarantees are made regarding its effectiveness across all applications.

The post The Art of Hybrid Architectures appeared first on Towards Data Science.

The Ultimate AI/ML Roadmap For Beginners

Egor Howell — Wed, 26 Mar 2025 04:10:12 +0000

AI is transforming the way businesses operate, and nearly every company is exploring how to leverage this technology.

As a result, the demand for AI and machine learning skills has skyrocketed in recent years.

With nearly four years of experience in AI/ML, I’ve decided to create the ultimate guide to help you enter this rapidly growing field.

Why work in AI/ML?

It’s no secret that AI and machine learning are some of the most desired technologies nowadays.

Being well-versed in these fields will open many career opportunities going forward, not to mention that you will be at the forefront of scientific advancement.

And to be blunt, you will be paid a lot.

According to Levelsfyi, the median salary for a machine learning engineer is £93k, and for an AI engineer is £75k. Whereas for a data scientist, it is £70k, and software engineer is £83k.

Don’t get me wrong; these are super high salaries on their own, but AI/ML will give you that edge, and the difference will likely grow more prominent in the future.

You also don’t need a PhD in computer science, maths, or physics to work on AI/ML. Good engineering and problem-solving skills, along with a good understanding of the fundamental ML concepts, are enough.

Most jobs are not research jobs but more implementing AI/ML solutions to real-life problems.

For example, I work as a machine learning engineer, but I don’t do research. I aim to use algorithms and apply them to business problems to benefit the customers and, thus, the company.

Below are jobs that use AI/ML:

Machine Learning Engineer
AI Engineer
Research Scientist
Research Engineer
Data Scientist
Software Engineer (AI/ML focus)
Data Engineer (AI/ML focus)
Machine Learning Platform Engineer
Applied Scientist

They all have different requirements and skills, so there will be something that suits you well.

If you want to learn more about the roles above, I recommend reading some of my previous articles.

The Difference Between ML Engineers and Data Scientists

Should You Become A Data Scientist, Data Analyst Or Data Engineer?
Explaining the differences and requirements between the various data rolesmedium.com

Right, let’s now get into the roadmap!

Maths

I’d argue that solid mathematics skills are probably the most essential for any tech professional, especially if you are working with AI/ML.

You need a good grounding to understand how AI and ML models work under the hood. This will help you better debug them and develop intuition about how to work with them.

Don’t get me wrong; you don’t need a PhD in quantum physics, but you should be knowledgeable in the following three areas.

Linear Algebra — to understand how matrices, eigenvalues and vectors work, which are used everywhere in AI and machine learning.
Calculus — to understand how AI actually learns using algorithms like gradient descent and backpropagation that utilise differentiation and integration.
Statistics — to understand the probabilistic nature of machine learning models through learning probability distributions, statistical inference and Bayesian statistics.

Resources:

Practical Statistics for Data Science (affiliate link) — Great applied book with coding examples in Python.
Mathematics for Machine Learning (affiliate link)— Best all-round book, but quite dense.
Essence of Linear Algebra (3Blue1Brown) — Fantastic visual explanations
Brilliant & Khan Academy — Wide range of information on all topics.

This is pretty much all you need; if anything, it’s slightly overkill in some aspects!

Timeline: Depending on background, this should take you a couple/few months to get up to speed.

I have in-depth breakdowns of the maths you need for Data Science, which is equally applicable here for AI/ML.

How to Learn the Math Needed for Data Science

Python

Python is the gold standard and the go-to programming language for machine learning and AI.

Beginners often get caught up in the so-called “best way” to learn Python. Any introductory course will suffice, as they teach the same things.

The main things you want to learn are:

Native data structures (dictionaries, lists, sets, and tuples)
For and while loops
If-else conditional statements
Functions and classes

You also want to learn specific scientific computing libraries such as:

NumPy — Numerical computing and arrays.
Pandas — Data manipulation and analysis.
Matplotlib & Plotly — Data visualization.
scikit-learn — Implementing classical ML algorithms.

Resources:

W3Schools Python Course (free) — Great free resource.
Python for Everybody Specialization (Coursera) — Probably the most popular course on Python.
Machine Learning with Python and Scikit-Learn — Full Course — A video course on implementing basic ML algorithms using Python.

Timeline: Again, depending on your background, this should take a couple of months. If you know Python already, it will be a lot quicker.

Data structures and algorithms

This one may seem slightly out of place, but if you want to be a machine learning or AI engineer, you must know data structures and algorithms.

This is not only for interviews; it is also used in AI/ML algorithms. You will come across things like backtracking, depth-first search, and binary trees more than you think.

The things to learn are:

Arrays & Linked Lists
Trees & Graphs
HashMaps, Queues & Stacks
Sorting & Searching Algorithms
Dynamic Programming

Resources:

Neetcode.io — Great introductory, intermediate and advanced data structure and algorithm courses.
Leetcode & Hackerrank — Platforms to practise.

Timeline: Around a month to nail the basics.

Machine learning

This is where the fun begins!

The previous four steps involved getting your foundation ready to tackle machine learning.

In general, machine learning falls into two categories:

Supervised learning — where we have target labels to train the model.
Unsupervised learning — when there are no target labels.

The diagram below illustrates this split and some algorithms in each category.

Diagram by author.

The key algorithms and concepts you should learn are:

Linear, logistic and polynomial regression.
Decision trees, random forests and gradient-boosted trees.
Support vector machines.
K-means and K-nearest neighbour clustering.
Feature engineering.
Evaluation metrics.
Regularisation, bias vs variance tradeoff and cross-validation.

Resources:

Machine Learning Specialisation by Andrew Ng — This is my first ML course, and I think it is probably the best one out there.
The Hundred-Page ML Book (affiliate link)— Concise with practical insights into building ML models.
Hands-On ML with Scikit-Learn, Keras, and TensorFlow (affiliate link) — If I had to give only one book to learn machine learning, this would be it!
The Elements of Statistical Learning (affiliate link)— Excellent for mastering machine learning fundamentals, basically statistical learning.

Timeline: This section is quite dense, so it will likely take roughly ~3 months to know most of this information. In reality, it will take years to truly master everything in those resources.

AI and deep learning

There has been a lot of hype around AI since ChatGPT was released in 2022.

However, AI itself has been around as a concept for a long time, dating back in its current form to the 1950s, when the neural network originated.

The AI we refer to at the moment is specifically called generative AI (GenAI), which is actually quite a small subset of the whole AI eco-system as shown below.

Image by author.

As its name suggests, GenAI is an algorithm that generates text, images, audio, and even code.

Until recently, the AI landscape was dominated by two main models:

Convolutional Neural Networks (CNNs) — These were used for computer vision tasks such as recognising and categorising images.
Recurrent Neural Networks (RNNs) — These were used for sequence-based data like time series and natural language.

However, in 2017, a paper called “Attention Is All You Need” was published, introducing the transformer architecture and model, which has since superseded CNNs and RNNs.

Today, transformers are the backbone of large language models (LLMs) and unequivocally rule the AI landscape.

With all this in mind, the things you should know are:

Neural Networks — The algorithm that really puts AI/ML on the map.
Convolutional and Recurrent Neural Networks — Still used today quite a bit for their specific tasks.
Transformers — The current state of the art.
RAG, Vector Databases, LLM Fine Tuning — These technologies and concepts are crucial to the current AI infrastructure.
Reinforcement Learning — The third type of learning used to create AI like AlphaGO.

Resources:

Deep Learning Specialization by Andrew Ng. — This is the follow-on course from the Machine Learning SpecialiSation and will teach all you need to know about Deep Learning, CNNs, and RNNs.
Introduction to LLMs by Andrej Karpathy (former senior director of AI at Tesla) — learn more about LLMs and how they are trained.
Neural Networks: Zero to Hero — Starts relatively slow, building a neural network from scratch. However, in the last video, he gets you building your own Generative Pre-trained Transformers (GPT)!
Reinforcement Learning Course — Lectures by David Silver, a lead researcher at DeepMind.

Timeline: There is a lot here and it’s call quite hard and cutting edge stuff. So around 3 months is probably what it will take you.

MLOps

A model in a Jupyter Notebook has no value, as I have said many times.

For your AI/ML models to be useful, you must learn how to deploy them to production.

Areas to learn are:

Cloud technologies like AWS, GCP or Azure.
Docker and Kubernetes.
How to write production code.
Git, CircleCI, Bash/Zsh.

Resources:

Practical MLOps (affiliate link) — This is probably the only book you need to understand how to deploy your machine-learning model. I use it more as a reference text, but it teaches almost everything you need to know.
Designing Machine Learning Systems (affiliate link) — Another great book and resource to vary your information source.

Research papers

AI is evolving rapidly, so it’s worth staying up to date with all the latest developments.

ML papers of the week — A newsletter and Twitter account that sends out the most significant AI papers published that week, along with their key links, every week.
ArXiv — The de-facto place to find research papers.

Some papers I recommend you read are:

Attention Is All You Need — This is the original paper introducing Transformers, which power models like ChatGPT, BERT, and GPT-4.
Mastering the game of Go with deep neural networks and tree search — DeepMind’s paper on how they created an AI that beat the world’s best GO player.
DeepSeek R1: Incentivizing Reasoning Capability in LLMs — Recent work on improving logical reasoning in large language models.
BERT: Pre-training of Deep Bidirectional Transformers — A deep dive into BERT, one of the first self-supervised NLP models that improved contextual understanding.
Highly accurate protein structure prediction with AlphaFold — DeepMind solving protein folding using AI, a massive issue in healthcare that AI has helped to advance!

You can find a comprehensive list here.

Conclusion

Breaking into AI/ML may seem overwhelming, but it’s all about taking it one step at a time.

Learn the basics like Python, maths and data structures and algorithms.
Get your AI/ML knowledge learning supervised learning, neural networks and transformers.
Learn how to deploy AI algorithms.

The space is ginormous, so it will probably take you about a year to fully grasp everything in this roadmap, and that’s fine. There are literally bachelor’s degrees dedicated to this space, which take three years,

Just go at your own pace, and eventually, you will get to where you want to be.

Happy learning!

Another thing!

Join my free newsletter, Dishing the Data, where I share weekly tips, insights, and advice from my experience as a practicing data scientist. Plus, as a subscriber, you’ll get my FREE Data Science Resume Template!

Dishing The Data | Egor Howell | Substack
Advice and learnings on data science, tech and entrepreneurship. Click to read Dishing The Data, by Egor Howell, a…newsletter.egorhowell.com

Connect with me

The post The Ultimate AI/ML Roadmap For Beginners appeared first on Towards Data Science.

What Do Machine Learning Engineers Do?

Egor Howell — Tue, 25 Mar 2025 07:45:20 +0000

In this article, I want to explain precisely what I do as a machine learning engineer.

The aim is to help anyone looking to enter the field gain a truthful view of what a machine learning engineer is, how we work, what we do, and what a typical day in life is like.

I hope it can help you pinpoint if a career in machine learning is indeed for you.

What is a machine learning engineer?

Due to the rapid acceleration of the tech/AI space, a machine learning engineer is still not well-defined and varies between companies and geographies to a certain extent.

However, it generally refers to someone who:

Mixes machine learning, statistics and software engineering skills to train and deploy models into production.

At some companies, there will be a large cross-over with data scientists. Still, the main distinction between the two roles is that machine learning engineers deliver the solution into production. Often, data scientists won’t do this and focus more on helping in the model-building stage.

The need for a machine learning engineer came from the fact that models in Jupyter Notebooks have zero value. So, a role well-versed in machine learning and software engineering was needed to help bring the models “to life” and ensure they generate business value.

Because of this broad skillset, machine learning engineering is not an entry-level role, and you would typically need to be a data scientist or software engineer for a couple of years first.

So, to summarise:

Responsibilities: Train, build and deploy machine learning models.
Skills & Tech: Python, SQL, AWS, Bash/Zsh, PyTorch, Docker, Kubernetes, MLOps, Git, distributed computing (not an exhaustive list).
Experience: A couple of years as a data scientist or software engineer, and then up-skill yourself in the other areas.

If you want a better understanding of the different data and machine learning roles, I recommend checking out some of my previous articles.

The Difference Between ML Engineers and Data Scientists
Helping you decide whether you want to be a data scientist or machine learning engineermedium.com

Should You Become A Data Scientist, Data Analyst Or Data Engineer?
Explaining the differences and requirements between the various data rolesmedium.com

What do I do?

I work as a machine learning engineer within a cross-functional team. My squad specialises in classical machine learning and combinatorial optimisation-based problems.

Much of my work revolves around improving our machine learning models and optimisation solutions to improve the customer experience and generate financial value for the business.

The general workflow for most of my projects is as follows:

Idea — Someone may have an idea or hypothesis about how to improve one of our models.
Data — We check if the data to prove or disprove this hypothesis is readily available so we can start the research.
Research — If the data is available, we start building or testing this new hypothesis in the model.
Analysis — The results of the research stage are analysed to determine if we have improved the model.
Ship — The improvement is “productionised” in the codebase and goes live.

Along this process, there is a lot of interaction with other functions and roles within the team and broader company.

The idea phase is a collaborative discussion with a product manager who can provide business insight and any critical impacts we may have missed in the initial scoping.
Data, Build, and Analysis can be done in collaboration with data analysts and engineers to ensure the quality of our ETL pipelines and the use of the right data sources.
The research section would use the help of data scientists to use statistics and machine learning skills when looking to improve our model.
The ship phase is a joint effort with our dedicated software engineers, ensuring our deployment is robust and up to standard with best coding practices.

From experience, I know that this type of workflow is prevalent among machine learning engineers in numerous companies, although I am sure there are slight variations depending on where you are.

My job is also not just to write code day in and day out. I have other responsibilities, like conducting workshops, presenting to stakeholders, and mentoring more junior members.

What is the structure of machine learning teams?

Machine learning engineers work in many different ways across an organisation, but there are three distinct options, and the rest are a mix of them.

Embedded — In this case, machine learning engineers are embedded in cross-functional teams with analysts, product managers, software engineers and data scientists, where the team solves problems in one domain within the company. This is how I work, and I really like it because you get to pick up lots of valuable skills and abilities from other team members who are specialists in their own right.
Consultancy — This is the flip side, where machine learning engineers are part of an “in-house consultancy” and are part of their own team. In this scenario, the machine learning engineers work on problems based on their perceived value to the business. You are technically less specialised in this option as you may need to change the type of problems you work on.
Infrastructure/Platform — Instead of solving business problems directly, these machine learning engineers develop in-house tools and a deployment platform to make productionising the algorithms much easier.

All ways of working have pros and cons, and in reality, I wouldn’t say one is better than the other; it’s really a matter of personal preference. You still do exciting work, nonetheless!

What is a typical day in a life?

People online often glamourise working in tech, like it’s all coffee breaks, chats, and coding for an hour a day, and you make well over six figures.

This is definitely not the case, and I wish it was true, but it’s still a fun and enjoyable workday compared to many other professions.

My general experience has been:

9:00 am — 9:30 am. Start at 9 am with a morning standup to catch up with the team regarding the previous day’s work and what you are doing today. A “standup” meeting is very common across tech.
9:30 am — 10:30 am. After the standup, there may be another meeting for an hour, 9:30–10:30 or so, with stakeholders, engineers, an all-hands or other company meetings.
10:30 am — 13:00 pm. Then, it’s a work/code block for two hours or so where I focus on my projects. Depending on my work, I may pair with another data scientist, machine learning engineer or software engineer.
13:00 pm — 14:00 pm. Lunch.
14:00 pm — 17:45 pm. Afternoons are normally free of meetings, and there is a large block of focus time to work on your projects. This is mainly for individual contributors like myself.
17:45 pm — 18:00 pm. Reply to emails and Slack messages and wrap up for the day.

Every day is different, but this is what you can expect. As you can tell, it’s nothing “extrordinary.”

This is also the workday for a junior / mid-level individual contributor (IC) like myself. Senior positions, especially managerial roles, typically have more meetings.

An important thing to note is that I don’t always code in my work blocks. I may have a presentation to prepare for stakeholders, some ad-hoc analysis for our product manager, or some writing up of my latest research. I may not even code for the whole day!

On average, I spend 3–4 hours hard coding; the rest is meetings or ad-hoc work. Of course, this varies between companies and at different times of the year.

Why am I’m a machine learning engineer?

The reason I am a machine learning engineer can be boiled down to four main reasons:

Interesting. As a machine learning engineer, I get to be correct at the forefront of the latest tech trends like AI, LLMs, and pretty much anything that is going viral in the field. There is always something new and exciting to learn, which I love! So, if you want to constantly learn new skills and apply them, this may be a career you would be interested in.
Work-Life Balance. Tech jobs generally provide better work-life balance than other professions like banking, law or consulting. Most machine learning jobs are 9–6, and you can often spend a few days working from home. This flexibility allows me to pursue other passions, projects, and hobbies outside of work, such as this blog!
Compensation. It’s no secret that tech jobs provide some of the highest salaries. According to levelsfyi, the median salary of a machine learning engineer in the UK is £93k, which is crazy for an average value.
Range of Industries. As a machine learning engineer, you can work in loads of different industries during your career. However, to become a real specialist, you must find and stick to one industry you love.

I hope this article gave you more insight into machine learning, if you have any questions let me know in the comments.

Another thing!

Dishing The Data | Egor Howell | Substack
Advice and learnings on data science, tech and entrepreneurship. Click to read Dishing The Data, by Egor Howell, a…newsletter.egorhowell.com

Connect with me

The post What Do Machine Learning Engineers Do? appeared first on Towards Data Science.

This Is How LLMs Break Down the Language

Prashal Ruchiranga — Mon, 10 Mar 2025 14:01:07 +0000

Do you remember the hype when OpenAI released GPT-3 in 2020? Though not the first in its series, GPT-3 gained widespread popularity due to its impressive text generation capabilities. Since then, a diverse group of Large Language Models(Llms) have flooded the AI landscape. The golden question is: Have you ever wondered how ChatGPT or any other LLMs break down the language? If you haven’t yet, we are going to discuss the mechanism by which LLMs process the textual input given to them during training and inference. In principle, we call it tokenization.

This article is inspired by the YouTube video titled Deep Dive into LLMs like ChatGPT from former Senior Director of AI at Tesla, Andrej Karpathy. His general audience video series is highly recommended for those who want to take a deep dive into the intricacies behind LLMs.

Before diving into the main topic, I need you to have an understanding of the inner workings of a LLM. In the next section, I’ll break down the internals of a language model and its underlying architecture. If you’re already familiar with neural networks and LLMs in general, you can skip the next section without affecting your reading experience.

Internals of large language models

LLMs are made up of transformer neural networks. Consider neural networks as giant mathematical expressions. Inputs to neural networks are a sequence of tokens that are typically processed through embedding layers, which convert the tokens into numerical representations. For now, think of tokens as basic units of input data, such as words, phrases, or characters. In the next section, we’ll explore how to create tokens from input text data in depth. When we feed these inputs to the network, they are mixed into a giant mathematical expression along with the parameters or weights of these neural networks.

Modern neural networks have billions of parameters. At the beginning, these parameters or weights are set randomly. Therefore, the neural network randomly guesses its predictions. During the training process, we iteratively update these weights so that the outputs of our neural network become consistent with the patterns observed in our training set. In a sense, neural network training is about finding the right set of weights that seem to be consistent with the statistics of the training set.

The transformer architecture was introduced in the paper titled “Attention is All You Need” by Vaswani et al. in 2017. This is a neural network with a special kind of structure designed for sequence processing. Initially intended for Neural Machine Translation, it has since become the founding building block for LLMs.

To get a sense of what production grade transformer neural networks look like visit https://bbycroft.net/llm. This site provides interactive 3D visualizations of generative pre-trained transformer (GPT) architectures and guides you through their inference process.

Visualization of Nano-GPT at https://bbycroft.net/llm (Image by the author)

This particular architecture, called Nano-GPT, has around 85,584 parameters. We feed the inputs, which are token sequences, at the top of the network. Information then flows through the layers of the network, where the input undergoes a series of transformations, including attention mechanisms and feed-forward networks, to produce an output. The output is the model’s prediction for the next token in the sequence.

Tokenization

Training a state-of-the-art language model like ChatGPT or Claude involves several stages arranged sequentially. In my previous article about hallucinations, I briefly explained the training pipeline for an LLM. If you want to learn more about training stages and hallucinations, you can read it here.

Now, imagine we’re at the initial stage of training called pretraining. This stage requires a large, high-quality, web-scale dataset of terabyte size. The datasets used by major LLM providers are not publicly available. Therefore, we will look into an open-source dataset curated by Hugging Face, called FineWeb distributed under the Open Data Commons Attribution License. You can read more about how they collected and created this dataset here.

FineWeb dataset curated by Hugging Face (Image by the author)

I downloaded a sample from the FineWeb dataset, selected the first 100 examples, and concatenated them into a single text file. This is just raw internet text with various patterns within it.

Sampled text from the FineWeb dataset (Image by the author)

So our goal is to feed this data to the transformer neural network so that the model learns the flow of this text. We need to train our neural network to mimic the text. Before plugging this text into the neural network, we must decide how to represent it. Neural networks expect a one-dimensional sequence of symbols. That requires a finite set of possible symbols. Therefore, we must determine what these symbols are and how to represent our data as a one-dimensional sequence of them.

What we have at this point is a one-dimensional sequence of text. There is an underlined representation of a sequence of raw bits for this text. We can encode the original sequence of text with UTF-8 encoding to get the sequence of raw bits. If you check the image below, you can see that the first 8 bits of the raw bit sequence correspond to the first letter ‘A’ of the original one-dimensional text sequence.

Sampled text, represented as a one-dimensional sequence of bits (Image by the author)

Now, we have a very long sequence with two symbols: zero and one. This is, in fact, what we were looking for — a one-dimensional sequence of symbols with a finite set of possible symbols. Now the problem is that sequence length is a precious resource in a neural network primarily because of computational efficiency, memory constraints, and the difficulty of processing long dependencies. Therefore, we don’t want extremely long sequences of just two symbols. We prefer shorter sequences of more symbols. So, we are going to trade off the number of symbols in our vocabulary against the resulting sequence length.

As we need to further compress or shorten our sequence, we can group every 8 consecutive bits into a single byte. Since each bit is either 0 or 1, there are exactly 256 possible combinations of 8-bit sequences. Thus, we can represent this sequence as a sequence of bytes instead.

Grouping bits to bytes (Image by the author)

This representation reduces the length by a factor of 8, while expanding the symbol set to 256 possibilities. Consequently, each value in the sequence will fall within the range of 0 to 255.

Sampled text, represented as a one-dimensional sequence of bytes (Image by the author)

These numbers do not have any value in a numerical sense. They are just placeholders for unique identifiers or symbols. In fact, we could replace each of these numbers with a unique emoji and the core idea would still stand. Think of this as a sequence of emojis, each chosen from 256 unique options.

Sampled text, represented as a one-dimensional sequence of emojis (Image by the author)

This process of converting from raw text into symbols is called Tokenization. Tokenization in state-of-the-art language models goes even beyond this. We can further compress the length of the sequence in return for more symbols in our vocabulary using the Byte-Pair Encoding (BPE) algorithm. Initially developed for text compression, BPE is now widely used by transformer models for tokenization. OpenAI’s GPT series uses standard and customized versions of the BPE algorithm.

Essentially, byte pair encoding involves identifying frequent consecutive bytes or symbols. For example, we can look into our byte level sequence of text.

Sequence 101, followed by 114, is quite frequent (Image by the author)

As you can see, the sequence 101 followed by 114 appears frequently. Therefore, we can replace this pair with a new symbol and assign it a unique identifier. We are going to rewrite every occurrence of 101 114 using this new symbol. This process can be repeated multiple times, with each iteration further shortening the sequence length while introducing additional symbols, thereby increasing the vocabulary size. Using this process, GPT-4 has come up with a token vocabulary of around 100,000.

We can further explore tokenization using Tiktokenizer. Tiktokenizer provides an interactive web-based graphical user interface where you can input text and see how it’s tokenized according to different models. Play with this tool to get an intuitive understanding of what these tokens look like.

For example, we can take the first four sentences of the text sequence and input them into the Tiktokenizer. From the dropdown menu, select the GPT-4 base model encoder: cl100k_base.

Tiktokenizer (Image by the author)

The colored text shows how the chunks of text correspond to the symbols. The following text, which is a sequence of length 51, is what GPT-4 will see at the end of the day.

11787, 499, 21815, 369, 90250, 763, 14689, 30, 7694, 1555, 279, 21542, 3770, 323, 499, 1253, 1120, 1518, 701, 4832, 2457, 13, 9359, 1124, 323, 6642, 264, 3449, 709, 3010, 18396, 13, 1226, 617, 9214, 315, 1023, 3697, 430, 1120, 649, 10379, 83, 3868, 311, 3449, 18570, 1120, 1093, 499, 0

We can now take our entire sample dataset and re-represent it as a sequence of tokens using the GPT-4 base model tokenizer, cl100k_base. Note that the original FineWeb dataset consists of a 15-trillion-token sequence, while our sample dataset contains only a few thousand tokens from the original dataset.

Sampled text, represented as a one-dimensional sequence of tokens (Image by the author)

Conclusion

Tokenization is a fundamental step in how LLMs process text, transforming raw text data into a structured format before being fed into neural networks. As neural networks require a one-dimensional sequence of symbols, we need to achieve a balance between sequence length and the number of symbols in the vocabulary, optimizing for efficient computation. Modern state-of-the-art transformer-based LLMs, including GPT and GPT-2, use Byte-Pair Encoding tokenization.

Breaking down tokenization helps demystify how LLMs interpret text inputs and generate coherent responses. Having an intuitive sense of what tokenization looks like helps in understanding the internal mechanisms behind the training and inference of LLMs. As LLMs are increasingly used as a knowledge base, a well-designed tokenization strategy is crucial for improving model efficiency and overall performance.

If you enjoyed this article, connect with me on X (formerly Twitter) for more insights.

References

The post This Is How LLMs Break Down the Language appeared first on Towards Data Science.

Deep Learning | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Training a Conversational Speech Model

But what does it really take to generate audio?

Preprocessing audio

What is Audio Quantization?

Vector Quantization

Residual Vector Quantization

Acoustic vs Semantic Codebooks

Audio Decoder

In Summary

References and Must-read papers

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Is it the same? Is it different?

Can CNNs learn same-different relationships?

Conclusions

Another thing!

Reference

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

Introduction

High-Level Overview: Autonomous Debates with Multiple Agents

Prerequisites and Setup

Under the Hood: State Management and Workflow Setup

Defining the Debate State Schema (debate_state.py)

Constants and Configuration

Building the LangGraph Workflow (debate_workflow.py)

Agent Nodes Breakdown

BaseComponent – A Reusable Agent Base Class

Topic Generator Agent (GenerateTopicNode)

Debater Agents (Pro and Con)

Fact Checker Agent (FactCheckNode)

Debate Moderator Agent (DebateModeratorNode)

Fact Check Router (FactCheckRouterNode)

Judge Agent (JudgeNode)

Langsmith Tracing

Reflections and Next Steps

References

The Case for Centralized AI Model Inference Serving

Toy Experiment

Step 1: Creating a TorchScript Model Checkpoint

Step 2: Model Inference Function

Step 3: Running Parallel Inference Jobs

Estimating the Maximum Number of Processes

The Inefficiencies of Independent Model Execution

TorchServe Setup

Installation

Creating a Model Archive

TorchServe Configuration

Starting TorchServe

Inference Request Implementation

Scaling Up the Number of Concurrent Jobs

Results

Next Steps

Batch Inference with TorchServe

Results

Multi-Worker Inference with TorchServe

Results

Summary

A Simple Implementation of the Attention Mechanism from Scratch

Introduction

Self-Attention in Transformers

K, V, Q representations

Multi-head Self-Attention in Transformers (MHSA)

Hands-on

Implementing MultiheadSelf-Attention

Final Thoughts

Understanding the Tech Stack Behind Generative AI

Understanding the Tech Stack Behind Generative AI

1 What makes generative AI work – at its core

Foundation Models

Multimodal models

GPU & Compute Providers

ML Frameworks & Libraries

AI Application Frameworks

Databases & Vector Stores

Programming Languages

2 Scaling AI: Infrastructure and Compute Power

Containers & Orchestration

Chip manufacturers & AI hardware

API Providers for Foundation Models

Defining the Debate State Schema (`debate_state.py`)

Building the LangGraph Workflow (`debate_workflow.py`)

`BaseComponent` – A Reusable Agent Base Class

Topic Generator Agent (`GenerateTopicNode`)

Fact Checker Agent (`FactCheckNode`)

Debate Moderator Agent (`DebateModeratorNode`)

Fact Check Router (`FactCheckRouterNode`)

Judge Agent (`JudgeNode`)