Computer Vision | Towards Data Science

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Salvatore Raieli — Fri, 11 Apr 2025 05:44:46 +0000

Liberating education consists in acts of cognition, not transferrals of information.
Paulo freire

One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing?

Many authors suggest that artificial intelligence models do not possess the same capabilities as humans, especially when it comes to plasticity, flexibility, and adaptation.

One of the aspects that models do not capture are several causal relationships about the external world.

This article discusses these issues:

The parallelism between convolutional neural networks (CNNs) and the human visual cortex
Limitations of CNNs in understanding causal relations and learning abstract concepts
How to make CNNs learn simple causal relations

Is it the same? Is it different?

Convolutional networks (CNNs) [2] are multi-layered neural networks that take images as input and can be used for multiple tasks. One of the most fascinating aspects of CNNs is their inspiration from the human visual cortex [1]:

Hierarchical processing. The visual cortex processes images hierarchically, where early visual areas capture simple features (such as edges, lines, and colors) and deeper areas capture more complex features such as shapes, objects, and scenes. CNN, due to its layered structure, captures edges and textures in the early layers, while layers further down capture parts or whole objects.
Receptive fields. Neurons in the visual cortex respond to stimuli in a specific local region of the visual field (commonly called receptive fields). As we go deeper, the receptive fields of the neurons widen, allowing more spatial information to be integrated. Thanks to pooling steps, the same happens in CNNs.
Feature sharing. Although biological neurons are not identical, similar features are recognized across different parts of the visual field. In CNNs, the various filters scan the entire image, allowing patterns to be recognized regardless of location.
Spatial invariance. Humans can recognize objects even when they are moved, scaled, or rotated. CNNs also possess this property.

The relationship between components of the visual system and CNN. Image source: here

These features have made CNNs perform well in visual tasks to the point of superhuman performance:

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well-trained on the validation images to be better aware of the existence of relevant classes. […] Our result (4.94%) exceeds the reported human-level performance. —source [3]

Although CNNs perform better than humans in several tasks, there are still cases where they fail spectacularly. For example, in a 2024 study [4], AI models failed to generalize image classification. State-of-the-art models perform better than humans for objects on upright poses but fail when objects are on unusual poses.

The right label is on the top of the object, and the AI wrong predicted label is below. Image source: here

In conclusion, our results show that (1) humans are still much more robust than most networks at recognizing objects in unusual poses, (2) time is of the essence for such ability to emerge, and (3) even time-limited humans are dissimilar to deep neural networks. —source [4]

In the study [4], they note that humans need time to succeed in a task. Some tasks require not only visual recognition but also abstractive cognition, which requires time.

The generalization abilities that make humans capable come from understanding the laws that govern relations among objects. Humans recognize objects by extrapolating rules and chaining these rules to adapt to new situations. One of the simplest rules is the “same-different relation”: the ability to define whether two objects are the same or different. This ability develops rapidly during infancy and is also importantly associated with language development [5-7]. In addition, some animals such as ducks and chimpanzees also have it [8]. In contrast, learning same-different relations is very difficult for neural networks [9-10].

Example of a same-different task for a CNN. The network should return a label of 1 if the two objects are the same or a label of 0 if they are different. Image source: here

Convolutional networks show difficulty in learning this relationship. Likewise, they fail to learn other types of causal relationships that are simple for humans. Therefore, many researchers have concluded that CNNs lack the inductive bias necessary to be able to learn these relationships.

These negative results do not mean that neural networks are completely incapable of learning same-different relations. Much larger and longer trained models can learn this relation. For example, vision-transformer models pre-trained on ImageNet with contrastive learning can show this ability [12].

Can CNNs learn same-different relationships?

The fact that broad models can learn these kinds of relationships has rekindled interest in CNNs. The same-different relationship is considered among the basic logical operations that make up the foundations for higher-order cognition and reasoning. Showing that shallow CNNs can learn this concept would allow us to experiment with other relationships. Moreover, it will allow models to learn increasingly complex causal relationships. This is an important step in advancing the generalization capabilities of AI.

Previous work suggests that CNNs do not have the architectural inductive biases to be able to learn abstract visual relations. Other authors assume that the problem is in the training paradigm. In general, the classical gradient descent is used to learn a single task or a set of tasks. Given a task t or a set of tasks T, a loss function L is used to optimize the weights φ that should minimize the function L:

Image source from here

This can be viewed as simply the sum of the losses across different tasks (if we have more than one task). Instead, the Model-Agnostic Meta-Learning (MAML) algorithm [13] is designed to search for an optimal point in weight space for a set of related tasks. MAML seeks to find an initial set of weights θ that minimizes the loss function across tasks, facilitating rapid adaptation:

Image source from here

The difference may seem small, but conceptually, this approach is directed toward abstraction and generalization. If there are multiple tasks, traditional training tries to optimize weights for different tasks. MAML tries to identify a set of weights that is optimal for different tasks but at the same time equidistant in the weight space. This starting point θ allows the model to generalize more effectively across different tasks.

Meta-learning initial weights for generalization. Image source from here

Since we now have a method biased toward generalization and abstraction, we can test whether we can make CNNs learn the same-different relationship.

In this study [11], they compared shallow CNNs trained with classic gradient descent and meta-learning on a dataset designed for this report. The dataset consists of 10 different tasks that test for the same-different relationship.

The Same-Different dataset. Image source from here

The authors [11] compare CNNs of 2, 4, or 6 layers trained in a traditional way or with meta-learning, showing several interesting results:

The performance of traditional CNNs shows similar behavior to random guessing.
Meta-learning significantly improves performance, suggesting that the model can learn the same-different relationship. A 2-layer CNN performs little better than chance, but by increasing the depth of the network, performance improves to near-perfect accuracy.

Comparison between traditional training and meta-learning for CNNs. Image source from here

One of the most intriguing results of [11] is that the model can be trained in a leave-one-out way (use 9 tasks and leave one out) and show out-of-distribution generalization capabilities. Thus, the model has learned abstracting behavior that is hardly seen in such a small model (6 layers).

out-of-distribution for same-different classification. Image source from here

Conclusions

Although convolutional networks were inspired by how the human brain processes visual stimuli, they do not capture some of its basic capabilities. This is especially true when it comes to causal relations or abstract concepts. Some of these relationships can be learned from large models only with extensive training. This has led to the assumption that small CNNs cannot learn these relations due to a lack of architecture inductive bias. In recent years, efforts have been made to create new architectures that could have an advantage in learning relational reasoning. Yet most of these architectures fail to learn these kinds of relationships. Intriguingly, this can be overcome through the use of meta-learning.

The advantage of meta-learning is to incentivize more abstractive learning. Meta-learning pressure toward generalization, trying to optimize for all tasks at the same time. To do this, learning more abstract features is favored (low-level features, such as the angles of a particular shape, are not useful for generalization and are disfavored). Meta-learning allows a shallow CNN to learn abstract behavior that would otherwise require many more parameters and training.

The shallow CNNs and same-different relationship are a model for higher cognitive functions. Meta-learning and different forms of training could be useful to improve the reasoning capabilities of the models.

Another thing!

You can look for my other articles on Medium, and you can also connect or reach me on LinkedIn or in Bluesky. Check this repository, which contains weekly updated ML & AI news, or here for other tutorials and here for AI reviews. I am open to collaborations and projects, and you can reach me on LinkedIn.

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Lindsay, 2020, Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future, link
Li, 2020, A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects, link
He, 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, link
Ollikka, 2024, A comparison between humans and AI at recognizing objects in unusual poses, link
Premark, 1981, The codes of man and beasts, link
Blote, 1999, Young children’s organizational strategies on a same–different task: A microgenetic study and a training study, link
Lupker, 2015, Is there phonologically based priming in the same-different task? Evidence from Japanese-English bilinguals, link
Gentner, 2021, Learning same and different relations: cross-species comparisons, link
Kim, 2018, Not-so-clevr: learning same–different relations strains feedforward neural networks, link
Puebla, 2021, Can deep convolutional neural networks support relational reasoning in the same-different task? link
Gupta, 2025, Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation, link
Tartaglini, 2023, Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations, link
Finn, 2017, Model-agnostic meta-learning for fast adaptation of deep networks, link

The post The Basis of Cognitive Complexity: Teaching CNNs to See Connections appeared first on Towards Data Science.

The Art of Noise

Muhammad Ardi — Thu, 03 Apr 2025 01:12:22 +0000

Introduction

In my last several articles I talked about generative deep learning algorithms, which mostly are related to text generation tasks. So, I think it would be interesting to switch to generative algorithms for image generation now. We knew that nowadays there have been plenty of deep learning models specialized for generating images out there, such as Autoencoder, Variational Autoencoder (VAE), Generative Adversarial Network (GAN) and Neural Style Transfer (NST). I actually got some of my writings about these topics posted on Medium as well. I provide you the links at the end of this article if you want to read them.

In today’s article, I would like to discuss the so-called diffusion model — one of the most impactful models in the field of deep learning for image generation. The idea of this algorithm was first proposed in the paper titled Deep Unsupervised Learning using Nonequilibrium Thermodynamics written by Sohl-Dickstein et al. back in 2015 [1]. Their framework was then developed further by Ho et al. in 2020 in their paper titled Denoising Diffusion Probabilistic Models [2]. DDPM was later adapted by OpenAI and Google to develop DALLE-2 and Imagen, which we knew that these models have impressive capabilities to generate high-quality images.

How Diffusion Model Works

Generally speaking, diffusion model works by generating image from noise. We can think of it like an artist transforming a splash of paint on a canvas into a beautiful artwork. In order to do so, the diffusion model needs to be trained first. There are two main steps required to be followed to train the model, namely forward diffusion and backward diffusion.

Figure 1. The forward and backward diffusion process [3].

As you can see in the above figure, forward diffusion is a process where Gaussian noise is applied to the original image iteratively. We keep adding the noise until the image is completely unrecognizable, at which point we can say that the image now lies in the latent space. Different from Autoencoders and GANs where the latent space typically has a lower dimension than the original image, the latent space in DDPM maintains the exact same dimensionality as the original one. This noising process follows the principle of a Markov Chain, meaning that the image at timestep t is affected only by timestep t-1. Forward diffusion is considered easy since what we basically do is just adding some noise step by step.

The second training phase is called backward diffusion, which our objective here is to remove the noise little by little until we obtain a clear image. This process follows the principle of the reverse Markov Chain, where the image at timestep t-1 can only be obtained based on the image at timestep t. Such a denoising process is really difficult since we need to guess which pixels are noise and which ones belong to the actual image content. Thus, we need to employ a neural network model to do so.

DDPM uses U-Net as the basis of the deep learning architecture for backward diffusion. However, instead of using the original U-Net model [4], we need to make several modifications to it so that it will be more suitable for our task. Later on, I am going to train this model on the MNIST Handwritten Digit dataset [5], and we will see whether it can generate similar images.

Well, that was pretty much all the fundamental concepts you need to know about diffusion models for now. In the next sections we are going to get even deeper into the details while implementing the algorithm from scratch.

PyTorch Implementation

We are going to start by importing the required modules. In case you’re not yet familiar with the imports below, both torch and torchvision are the libraries we’ll use for preparing the model and the dataset. Meanwhile, matplotlib and tqdm will help us display images and progress bars.

# Codeblock 1
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from tqdm import tqdm

As the modules have been imported, the next thing to do is to initialize some config parameters. Look at the Codeblock 2 below for the details.

# Codeblock 2
IMAGE_SIZE     = 28     #(1)
NUM_CHANNELS   = 1      #(2)

BATCH_SIZE     = 2
NUM_EPOCHS     = 10
LEARNING_RATE  = 0.001

NUM_TIMESTEPS  = 1000   #(3)
BETA_START     = 0.0001 #(4)
BETA_END       = 0.02   #(5)
TIME_EMBED_DIM = 32     #(6)
DEVICE = torch.device("cuda" if torch.cuda.is_available else "cpu")  #(7)
DEVICE

# Codeblock 2 Output
device(type='cuda')

At the lines marked with #(1) and #(2) I set IMAGE_SIZE and NUM_CHANNELS to 28 and 1, which these numbers are obtained from the image dimension in the MNIST dataset. The BATCH_SIZE, NUM_EPOCHS, and LEARNING_RATE variables are pretty straightforward, so I don’t think I need to explain them further.

At line #(3), the variable NUM_TIMESTEPS denotes the number of iterations in the forward and backward diffusion process. Timestep 0 is the condition where the image is in its original state (the leftmost image in Figure 1). In this case, since we set this parameter to 1000, timestep number 999 is going to be the condition where the image is completely unrecognizable (the rightmost image in Figure 1). It is important to keep in mind that the choice of the number of timesteps involves a tradeoff between model accuracy and computational cost. If we assign a small value for NUM_TIMESTEPS, the inference time is going to be shorter, yet the resulting image might not be really good since the model has fewer steps to refine the image in the backward diffusion stage. On the other hand, increasing NUM_TIMESTEPS will slow down the inference process, but we can expect the output image to have better quality thanks to the gradual denoising process which results in a more precise reconstruction.

Next, the BETA_START (#(4)) and BETA_END (#(5)) variables are used to control the amount of Gaussian noise added at each timestep, whereas TIME_EMBED_DIM (#(6)) is employed to determine the feature vector length for storing the timestep information. Lastly, at line #(7) I assign “cuda” to the DEVICE variable if Pytorch detects GPU installed in our machine. I highly recommend you run this project on GPU since training a diffusion model is computationally expensive. In addition to the above parameters, the values set for NUM_TIMESTEPS, BETA_START and BETA_END are all adopted directly from the DDPM paper [2].

The complete implementation will be done in several steps: constructing the U-Net model, preparing the dataset, defining noise scheduler for the diffusion process, training, and inference. We are going to discuss each of those stages in the following sub-sections.

The U-Net Architecture: Time Embedding

As I’ve mentioned earlier, the basis of a diffusion model is U-Net. This architecture is used because its output layer is suitable to represent an image, which definitely makes sense since it was initially introduced for image segmentation task at the first place. The following figure shows what the original U-Net architecture looks like.

Figure 2. The original U-Net model proposed in [4].

However, it is necessary to modify this architecture so that it can also take into account the timestep information. Not only that, since we will only use MNIST dataset, we also need to make the model smaller. Just remember the convention in deep learning that simpler models are often more effective for simple tasks.

In the figure below I show you the entire U-Net model that has been modified. Here you can see that the time embedding tensor is injected to the model at every stage, which will later be done by element-wise summation, allowing the model to capture the timestep information. Next, instead of repeating each of the downsampling and the upsampling stages four times like the original U-Net, in this case we will only repeat each of them twice. Additionally, it is worth noting that the stack of downsampling stages is also known as the encoder, whereas the stack of upsampling stages is often called the decoder.

Figure 3. The modified U-Net model for our diffusion task [3].

Now let’s start constructing the architecture by creating a class for generating the time embedding tensor, which the idea is similar to the positional embedding in Transformer. See the Codeblock 3 below for the details.

# Codeblock 3
class TimeEmbedding(nn.Module):
    def forward(self):
        time = torch.arange(NUM_TIMESTEPS, device=DEVICE).reshape(NUM_TIMESTEPS, 1)  #(1)
        print(f"time\t\t: {time.shape}")
          
        i = torch.arange(0, TIME_EMBED_DIM, 2, device=DEVICE)
        denominator = torch.pow(10000, i/TIME_EMBED_DIM)
        print(f"denominator\t: {denominator.shape}")
          
        even_time_embed = torch.sin(time/denominator)  #(1)
        odd_time_embed  = torch.cos(time/denominator)  #(2)
        print(f"even_time_embed\t: {even_time_embed.shape}")
        print(f"odd_time_embed\t: {odd_time_embed.shape}")
          
        stacked = torch.stack([even_time_embed, odd_time_embed], dim=2)  #(3)
        print(f"stacked\t\t: {stacked.shape}")
        time_embed = torch.flatten(stacked, start_dim=1, end_dim=2)  #(4)
        print(f"time_embed\t: {time_embed.shape}")
          
        return time_embed

What we basically do in the above code is to create a tensor of size NUM_TIMESTEPS × TIME_EMBED_DIM (1000×32), where every single row of this tensor will contain the timestep information. Later on, each of the 1000 timesteps will be represented by a feature vector of length 32. The values in the tensor themselves are obtained based on the two equations in Figure 4. In the Codeblock 3 above, these two equations are implemented at line #(1) and #(2), each forming a tensor having the size of 1000×16. Next, these tensors are combined using the code at line #(3) and #(4).

Here I also print out every single step done in the above codeblock so that you can get a better understanding of what is actually being done in the TimeEmbedding class. If you still want more explanation about the above code, feel free to read my previous post about Transformer which you can access through the link at the end of this article. Once you clicked the link, you can just scroll all the way down to the Positional Encoding section.

Figure 4. The sinusoidal positional encoding formula from the Transformer paper [6].

Now let’s check if the TimeEmbedding class works properly using the following testing code. The resulting output shows that it successfully produced a tensor of size 1000×32, which is exactly what we expected earlier.

# Codeblock 4
time_embed_test = TimeEmbedding()
out_test = time_embed_test()

# Codeblock 4 Output
time            : torch.Size([1000, 1])
denominator     : torch.Size([16])
even_time_embed : torch.Size([1000, 16])
odd_time_embed  : torch.Size([1000, 16])
stacked         : torch.Size([1000, 16, 2])
time_embed      : torch.Size([1000, 32])

The U-Net Architecture: DoubleConv

If you take a closer look at the modified architecture, you will see that we actually got lots of repeating patterns, such as the ones highlighted in yellow boxes in the following figure.

Figure 5. The processes done inside the yellow boxes will be implemented in the DoubleConv class [3].

These five yellow boxes share the same structure, where they consist of two convolution layers with the time embedding tensor injected right after the first convolution operation is performed. So, what we are going to do now is to create another class named DoubleConv to reproduce this structure. Look at the Codeblock 5a and 5b below to see how I do that.

# Codeblock 5a
class DoubleConv(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        super().__init__()
        
        self.conv_0 = nn.Conv2d(in_channels=in_channels,  #(2)
                                out_channels=out_channels, 
                                kernel_size=3, 
                                bias=False, 
                                padding=1)
        self.bn_0 = nn.BatchNorm2d(num_features=out_channels)  #(3)
        
        self.time_embedding = TimeEmbedding()  #(4)
        self.linear = nn.Linear(in_features=TIME_EMBED_DIM,  #(5)
                                out_features=out_channels)
        
        self.conv_1 = nn.Conv2d(in_channels=out_channels,  #(6)
                                out_channels=out_channels, 
                                kernel_size=3, 
                                bias=False, 
                                padding=1)
        self.bn_1 = nn.BatchNorm2d(num_features=out_channels)  #(7)
        
        self.relu = nn.ReLU(inplace=True)  #(8)

The two inputs of the __init__() method above gives us flexibility to configure the number of input and output channels (#(1)) so that the DoubleConv class can be used to instantiate all the five yellow boxes simply by adjusting its input arguments. As the name suggests, here we initialize two convolution layers (line #(2) and #(6)), each followed by a batch normalization layer and a ReLU activation function. Keep in mind that the two normalization layers need to be initialized separately (line #(3) and #(7)) since each of them has their own trainable normalization parameters. Meanwhile, the ReLU activation function should only be initialized once (#(8)) because it contains no parameters, allowing it to be used multiple times in different parts of the network. At line #(4), we initialize the TimeEmbedding layer we created earlier, which will later be connected to a standard linear layer (#(5)). This linear layer is responsible to adjust the dimension of the time embedding tensor so that the resulting output can be summed with the output from the first convolution layer in an element-wise manner.

Now let’s take a look at the Codeblock 5b below to better understand the flow of the DoubleConv block. Here you can see that the forward() method accepts two inputs: the raw image x and the timestep information t as shown at line #(1). We initially process the image with the first Conv-BN-ReLU sequence (#(2–4)). This Conv-BN-ReLU structure is typically used when working with CNN-based models, even if the illustration does not explicitly show the batch normalization and the ReLU layers. Apart from the image, we then take the t-th timestep information from our embedding tensor of the corresponding image (#(5)) and pass it through the linear layer (#(6)). We still need to expand the dimension of the resulting tensor using the code at line #(7) before performing element-wise summation at line #(8). Finally, we process the resulting tensor with the second Conv-BN-ReLU sequence (#(9–11)).

# Codeblock 5b
    def forward(self, x, t):  #(1)
        print(f'images\t\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        
        x = self.conv_0(x)  #(2)
        x = self.bn_0(x)    #(3)
        x = self.relu(x)    #(4)
        print(f'\nafter first conv\t: {x.size()}')
        
        time_embed = self.time_embedding()[t]      #(5)
        print(f'\ntime_embed\t\t: {time_embed.size()}')
        
        time_embed = self.linear(time_embed)       #(6)
        print(f'time_embed after linear\t: {time_embed.size()}')
        
        time_embed = time_embed[:, :, None, None]  #(7)
        print(f'time_embed expanded\t: {time_embed.size()}')
        
        x = x + time_embed  #(8)
        print(f'\nafter summation\t\t: {x.size()}')
        
        x = self.conv_1(x)  #(9)
        x = self.bn_1(x)    #(10)
        x = self.relu(x)    #(11)
        print(f'after second conv\t: {x.size()}')
        
        return x

To see if our DoubleConv implementation works properly, we are going to test it with the Codeblock 6 below. Here I want to simulate the very first instance of this block, which corresponds to the leftmost yellow box in Figure 5. To do so, we need to we need to set the in_channels and out_channels parameters to 1 and 64, respectively (#(1)). Next, we initialize two input tensors, namely x_test and t_test. The x_test tensor has the size of 2×1×28×28, representing a batch of two grayscale images having the size of 28×28 (#(2)). Keep in mind that this is just a dummy tensor of random values which will be replaced with the actual images from MNIST dataset later in the training phase. Meanwhile, t_test is a tensor containing the timestep numbers of the corresponding images (#(3)). The values for this tensor are randomly selected between 0 and NUM_TIMESTEPS (1000). Note that the datatype of this tensor must be an integer since the numbers will be used for indexing, as shown at line #(5) back in Codeblock 5b. Lastly, at line #(4) we pass both x_test and t_test tensors to the double_conv_test layer.

By the way, I re-run the previous codeblocks with the print() functions removed prior to running the following code so that the outputs will look neater.

# Codeblock 6
double_conv_test = DoubleConv(in_channels=1, out_channels=64).to(DEVICE)  #(1)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)  #(2)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)  #(3)

out_test = double_conv_test(x_test, t_test)  #(4)

# Codeblock 6 Output
images                  : torch.Size([2, 1, 28, 28])   #(1)
timesteps               : torch.Size([2]), tensor([468, 304], device='cuda:0')  #(2)

after first conv        : torch.Size([2, 64, 28, 28])  #(3)

time_embed              : torch.Size([2, 32])          #(4)
time_embed after linear : torch.Size([2, 64])
time_embed expanded     : torch.Size([2, 64, 1, 1])    #(5)

after summation         : torch.Size([2, 64, 28, 28])  #(6)
after second conv       : torch.Size([2, 64, 28, 28])  #(7)

The shape of our original input tensors can be seen at lines #(1) and #(2) in the above output. Specifically at line #(2), I also print out the two timesteps that we selected randomly. In this example we assume that each of the two images in the x tensor are already noised with the noise level from 468-th and 304-th timesteps prior to being fed into the network. We can see that the shape of the image tensor x changes to 2×64×28×28 after being passed through the first convolution layer (#(3)). Meanwhile, the size of our time embedding tensor becomes 2×32 (#(4)), which is obtained by extracting rows 468 and 304 from the original embedding of size 1000×32. In order to allow element-wise summation to be performed (#(6)), we need to map the 32-dimensional time embedding vectors into 64 and expand their axes, resulting in a tensor of size 2×64×1×1 (#(5)) so that it can be broadcast to the 2×64×28×28 tensor. After the summation is done, we then pass the tensor through the second convolution layer, at which point the tensor dimension does not change at all (#(7)).

The U-Net Architecture: Encoder

As we have successfully implemented the DoubleConv block, the next step to do is to implement the so-called DownSample block. In Figure 6 below, this corresponds to the parts enclosed in the red box.

Figure 6. The parts of the network highlighted in red are the so-called DownSample blocks [3].

The purpose of a DownSample block is to reduce the spatial dimension of an image, but it is important to note that at the same time it increases the number of channels. In order to achieve this, we can simply stack a DoubleConv block and a maxpooling operation. In this case the pooling uses 2×2 kernel size with the stride of 2, causing the spatial dimension of the image to be twice as small as the input. The implementation of this block can be seen in Codeblock 7 below.

# Codeblock 7
class DownSample(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        super().__init__()
        
        self.double_conv = DoubleConv(in_channels=in_channels,  #(2)
                                      out_channels=out_channels)
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)    #(3)
    
    def forward(self, x, t):  #(4)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        
        convolved = self.double_conv(x, t)   #(5)
        print(f'\nafter double conv\t: {convolved.size()}')
        
        maxpooled = self.maxpool(convolved)  #(6)
        print(f'after pooling\t\t: {maxpooled.size()}')
        
        return convolved, maxpooled          #(7)

Here I set the __init__() method to take number of input and output channels so that we can use it for creating the two DownSample blocks highlighted in Figure 6 without needing to write them in separate classes (#(1)). Next, the DoubleConv and the maxpooling layers are initialized at line #(2) and #(3), respectively. Remember that since the DoubleConv block accepts image x and the corresponding timestep t as the inputs, we also need to set the forward() method of this DownSample block such that it accepts both of them as well (#(4)). The information contained in x and t are then combined as the two tensors are processed by the double_conv layer, which the output is stored in the variable named convolved (#(5)). Afterwards, we now actually perform the downsampling with the maxpooling operation at line #(6), producing a tensor named maxpooled. It is important to note that both the convolved and maxpooled tensors are going to be returned, which is essentially done because we will later bring maxpooled to the next downsampling stage, whereas the convolved tensor will be transferred directly to the upsampling stage in the decoder through skip-connections.

Now let’s test the DownSample class using the Codeblock 8 below. The input tensors used here are exactly the same as the ones in Codeblock 6. Based on the resulting output, we can see that the pooling operation successfully converted the output of the DoubleConv block from 2×64×28×28 (#(1)) to 2×64×14×14 (#(2)), indicating that our DownSample class works properly.

# Codeblock 8
down_sample_test = DownSample(in_channels=1, out_channels=64).to(DEVICE)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)

out_test = down_sample_test(x_test, t_test)

# Codeblock 8 Output
original          : torch.Size([2, 1, 28, 28])
timesteps         : torch.Size([2]), tensor([468, 304], device='cuda:0')

after double conv : torch.Size([2, 64, 28, 28])  #(1)
after pooling     : torch.Size([2, 64, 14, 14])  #(2)

The U-Net Architecture: Decoder

We need to introduce the so-called UpSample block in the decoder, which is responsible for reverting the tensor in the intermediate layers to the original image dimension. In order to maintain a symmetrical structure, the number of UpSample blocks must match that of the DownSample blocks. Look at the Figure 7 below to see where the two UpSample blocks are placed.

Figure 7. The components inside the blue boxes are the so-called UpSample blocks [3].

Since both UpSample blocks are structurally identical, we can just initialize a single class for them, just like the DownSample class we created earlier. Look at the Codeblock 9 below to see how I implement it.

# Codeblock 9
class UpSample(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        
        self.conv_transpose = nn.ConvTranspose2d(in_channels=in_channels,  #(1)
                                                 out_channels=out_channels, 
                                                 kernel_size=2, stride=2)  #(2)
        self.double_conv = DoubleConv(in_channels=in_channels,  #(3)
                                      out_channels=out_channels)
        
    def forward(self, x, t, connection):  #(4)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        print(f'connection\t\t: {connection.size()}')
        
        x = self.conv_transpose(x)  #(5)
        print(f'\nafter conv transpose\t: {x.size()}')
        
        x = torch.cat([x, connection], dim=1)  #(6)
        print(f'after concat\t\t: {x.size()}')
        
        x = self.double_conv(x, t)  #(7)
        print(f'after double conv\t: {x.size()}')
        
        return x

In the __init__() method, we use nn.ConvTranspose2d to upsample the spatial dimension (#(1)). Both the kernel size and stride are set to 2 so that the output will be twice as large (#(2)). Next, the DoubleConv block will be employed to reduce the number of channels, while at the same time combining the timestep information from the time embedding tensor (#(3)).

The flow of this UpSample class is a bit more complicated than the DownSample class. If we take a closer look at the architecture, we’ll see that that we also have a skip-connection coming directly from the encoder. Thus, we need the forward() method to accept another argument in addition to the original image x and the timestep t, namely the residual tensor connection (#(4)). The first thing we do inside this method is to process the original image x with the transpose convolution layer (#(5)). In fact, not only upsampling the spatial size, but this layer also reduces the number of channels at the same time. However, the resulting tensor is then directly concatenated with connection in a channel-wise manner (#(6)), causing it to seem like no channel reduction is performed. It is important to know that at this point these two tensors are just concatenated, meaning that the information from the two are not yet combined. We finally feed these concatenated tensors to the double_conv layer (#(7)), allowing them to share information to each other through the learnable parameters inside the convolution layers.

The Codeblock 10 below shows how I test the UpSample class. The size of the tensors to be passed through are set according to the second upsampling block, i.e., the rightmost blue box in Figure 7.

# Codeblock 10
up_sample_test = UpSample(in_channels=128, out_channels=64).to(DEVICE)

x_test = torch.randn((BATCH_SIZE, 128, 14, 14)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)
connection_test = torch.randn((BATCH_SIZE, 64, 28, 28)).to(DEVICE)

out_test = up_sample_test(x_test, t_test, connection_test)

In the resulting output below, if we compare the input tensor (#(1)) with the final tensor shape (#(2)), we can clearly see that the number of channels successfully reduced from 128 to 64, while at the same time the spatial dimension increased from 14×14 to 28×28. This essentially means that our UpSample class is now ready to be used in the main U-Net architecture.

# Codeblock 10 Output
original             : torch.Size([2, 128, 14, 14])   #(1)
timesteps            : torch.Size([2]), tensor([468, 304], device='cuda:0')
connection           : torch.Size([2, 64, 28, 28])

after conv transpose : torch.Size([2, 64, 28, 28])
after concat         : torch.Size([2, 128, 28, 28])
after double conv    : torch.Size([2, 64, 28, 28])    #(2)

The U-Net Architecture: Putting All Components Together

Once all U-Net components have been created, what we are going to do next is to wrap them together into a single class. Look at the Codeblock 11a and 11b below for the details.

# Codeblock 11a
class UNet(nn.Module):
    def __init__(self):
        super().__init__()
      
        self.downsample_0 = DownSample(in_channels=NUM_CHANNELS,  #(1)
                                       out_channels=64)
        self.downsample_1 = DownSample(in_channels=64,            #(2)
                                       out_channels=128)
      
        self.bottleneck   = DoubleConv(in_channels=128,           #(3)
                                       out_channels=256)
      
        self.upsample_0   = UpSample(in_channels=256,             #(4)
                                     out_channels=128)
        self.upsample_1   = UpSample(in_channels=128,             #(5)
                                     out_channels=64)
      
        self.output = nn.Conv2d(in_channels=64,                   #(6)
                                out_channels=NUM_CHANNELS,
                                kernel_size=1)

You can see in the __init__() method above that we initialize two downsampling (#(1–2)) and two upsampling (#(4–5)) blocks, which the number of input and output channels are set according to the architecture shown in the illustration. There are actually two additional components I haven’t explained yet, namely the bottleneck (#(3)) and the output layer (#(6)). The former is essentially just a DoubleConv block, which acts as the main connection between the encoder and the decoder. Look at the Figure 8 below to see which components of the network belong to the bottleneck layer. Next, the output layer is a standard convolution layer which is responsible to turn the 64-channel image produced by the last UpSampling stage into 1-channel only. This operation is done using a kernel of size 1×1, meaning that it combines information across all channels while operating independently at each pixel position.

Figure 8. The bottleneck layer (the lower part of the model) acts as the main bridge between the encoder and the decoder of U-Net [3].

I guess the forward() method of the entire U-Net in the following codeblock is pretty straightforward, as what we essentially do here is pass the tensors from one layer to another — just don’t forget to include the skip connections between the downsampling and upsampling blocks.

# Codeblock 11b
    def forward(self, x, t):  #(1)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
            
        convolved_0, maxpooled_0 = self.downsample_0(x, t)
        print(f'\nmaxpooled_0\t\t: {maxpooled_0.size()}')
            
        convolved_1, maxpooled_1 = self.downsample_1(maxpooled_0, t)
        print(f'maxpooled_1\t\t: {maxpooled_1.size()}')
            
        x = self.bottleneck(maxpooled_1, t)
        print(f'after bottleneck\t: {x.size()}')
    
        upsampled_0 = self.upsample_0(x, t, convolved_1)
        print(f'upsampled_0\t\t: {upsampled_0.size()}')
            
        upsampled_1 = self.upsample_1(upsampled_0, t, convolved_0)
        print(f'upsampled_1\t\t: {upsampled_1.size()}')
            
        x = self.output(upsampled_1)
        print(f'final output\t\t: {x.size()}')
            
        return x

Now let’s see whether we have correctly constructed the U-Net class above by running the following testing code.

# Codeblock 12
unet_test = UNet().to(DEVICE)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)

out_test = unet_test(x_test, t_test)

# Codeblock 12 Output
original         : torch.Size([2, 1, 28, 28])   #(1)
timesteps        : torch.Size([2]), tensor([468, 304], device='cuda:0')

maxpooled_0      : torch.Size([2, 64, 14, 14])  #(2)
maxpooled_1      : torch.Size([2, 128, 7, 7])   #(3)
after bottleneck : torch.Size([2, 256, 7, 7])   #(4)
upsampled_0      : torch.Size([2, 128, 14, 14])
upsampled_1      : torch.Size([2, 64, 28, 28])
final output     : torch.Size([2, 1, 28, 28])   #(5)

We can see in the above output that the two downsampling stages successfully converted the original tensor of size 1×28×28 (#(1)) into 64×14×14 (#(2)) and 128×7×7 (#(3)), respectively. This tensor is then passed through the bottleneck layer, causing its number of channels to expand to 256 without changing the spatial dimension (#(4)). Lastly, we upsample the tensor twice before eventually shrinking the number of channels to 1 (#(5)). Based on this output, it looks like our model is working properly. Thus, it is now ready to be trained for our diffusion task.

Dataset Preparation

As we have successfully created the entire U-Net architecture, the next thing to do is to prepare the MNIST Handwritten Digit dataset. Before actually loading it, we need to define the preprocessing steps first using the transforms.Compose() method from Torchvision, as shown at line #(1) in Codeblock 13. There are two things we do here: converting the images into PyTorch tensors which also scales the pixel values from 0–255 to 0–1 (#(2)), and normalize them so that the final pixel values ranging between -1 and 1 (#(3)). Next, we download the dataset using datasets.MNIST(). In this case, we are going to take the images from the training data, hence we use train=True (#(5)). Don’t forget to pass the transform variable we initialized earlier to the transform parameter (transform=transform) so that it will automatically preprocess the images as we load them (#(6)). Lastly, we need to employ DataLoader to load the images from mnist_dataset (#(7)). The arguments I use for the input parameters are intended to randomly pick BATCH_SIZE (2) images from the dataset in each iteration.

# Codeblock 13
transform = transforms.Compose([  #(1)
    transforms.ToTensor(),        #(2)
    transforms.Normalize((0.5,), (0.5,))  #(3)
])

mnist_dataset = datasets.MNIST(   #(4)
    root='./data', 
    train=True,           #(5)
    download=True, 
    transform=transform   #(6)
)

loader = DataLoader(mnist_dataset,  #(7)
                    batch_size=BATCH_SIZE,
                    drop_last=True, 
                    shuffle=True)

In the following codeblock, I try to load a batch of images from the dataset. In every iteration, loader provides both the images and the corresponding labels, hence we need to store them in two separate variables: images and labels.

# Codeblock 14
images, labels = next(iter(loader))

print('images\t\t:', images.shape)
print('labels\t\t:', labels.shape)
print('min value\t:', images.min())
print('max value\t:', images.max())

We can see in the resulting output below that the images tensor has the size of 2×1×28×28 (#(1)), indicating that two grayscale images of size 28×28 have been successfully loaded. Here we can also see that the length of the labels tensor is 2, which matches the number of the loaded images (#(2)). Note that in this case the labels are going to be completely ignored. My plan here is that I just want the model to generate any number it previously seen from the entire training dataset without even knowing what number it actually is. Lastly, this output also shows that the preprocessing works properly, as the pixel values now range between -1 and 1.

# Codeblock 14 Output
images    : torch.Size([2, 1, 28, 28])  #(1)
labels    : torch.Size([2])             #(2)
min value : tensor(-1.)
max value : tensor(1.)

Run the following code if you want to see what the image we just loaded looks like.

# Codeblock 15   
plt.imshow(images[0].squeeze(), cmap='gray')
plt.show()

Figure 9. Output from Codeblock 15 [3].

Noise Scheduler

In this section we are going to talk about how the forward and backward diffusion are performed, which the process essentially involves adding or removing noise little by little at each timestep. It is necessary to know that we basically want a uniform amount of noise across all timesteps, where in the forward diffusion the image should be completely full of noise exactly at timestep 1000, while in the backward diffusion, we have to get the completely clear image at timestep 0. Hence, we need something to control the noise amount for each timestep. Later in this section, I am going to implement a class named NoiseScheduler to do so. — This will probably be the most mathy section of this article, as I’ll display many equations here. But don’t worry about that since we’ll focus on implementing these equations rather than discussing the mathematical derivations.

Now let’s take a look at the equations in Figure 10 which I will implement in the __init__() method of the NoiseScheduler class below.

Figure 10. The equations we need to implement in the __init__() method of the NoiseScheduler class [3].

# Codeblock 16a
class NoiseScheduler:
    def __init__(self):
        self.betas = torch.linspace(BETA_START, BETA_END, NUM_TIMESTEPS)  #(1)
        self.alphas = 1. - self.betas
        self.alphas_cum_prod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cum_prod = torch.sqrt(self.alphas_cum_prod)
        self.sqrt_one_minus_alphas_cum_prod = torch.sqrt(1. - self.alphas_cum_prod)

The above code works by creating multiple sequences of numbers, all of them are basically controlled by BETA_START (0.0001), BETA_END (0.02), and NUM_TIMESTEPS (1000). The first sequence we need to instantiate is the betas itself, which is done using torch.linspace() (#(1)). What it essentially does is that it generates a 1-dimensional tensor of length 1000 starting from 0.0001 to 0.02, where every single element in this tensor corresponds to a single timestep. The interval between each element is uniform, allowing us to generate uniform amount of noise throughout all timesteps as well. With this betas tensor, we then compute alphas, alphas_cum_prod, sqrt_alphas_cum_prod and sqrt_one_minus_alphas_cum_prod based on the four equations in Figure 10. Later on, these tensors will act as the basis of how the noise is generated or removed during the diffusion process.

Diffusion is normally done in a sequential manner. However, the forward diffusion process is deterministic, hence we can derive the original equation into a closed form so that we can obtain the noise at a specific timestep without having to iteratively add noise from the very beginning. The Figure 11 below shows what the closed form of the forward diffusion looks like, where x₀ represents the original image while epsilon (ϵ) denotes an image made up of random Gaussian noise. We can think of this equation as a weighted combination, where we combine the clear image and the noise according to weights determined by the timestep, resulting in an image with a specific amount of noise.

Figure 11. The closed form of the forward diffusion process [3].

The implementation of this equation can be seen in Codeblock 16b. In this forward_diffusion() method, x₀ and ϵ are denoted as original and noise. Here you need to keep in mind that these two input variables are images, whereas sqrt_alphas_cum_prod_t and sqrt_one_minus_alphas_cum_prod_t are scalars. Thus, we need to adjust the shape of these two scalars (#(1) and #(2)) so that the operation at line #(3) can be performed. The noisy_image variable is going to be the output of this function, which I guess the name is self-explanatory.

# Codeblock 16b
    def forward_diffusion(self, original, noise, t):
        sqrt_alphas_cum_prod_t = self.sqrt_alphas_cum_prod[t]
        sqrt_alphas_cum_prod_t = sqrt_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1)  #(1)
        
        sqrt_one_minus_alphas_cum_prod_t = self.sqrt_one_minus_alphas_cum_prod[t]
        sqrt_one_minus_alphas_cum_prod_t = sqrt_one_minus_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1)  #(2)
        
        noisy_image = sqrt_alphas_cum_prod_t * original + sqrt_one_minus_alphas_cum_prod_t * noise  #(3)
        
        return noisy_image

Now let’s talk about backward diffusion. In fact, this one is a bit more complicated than the forward diffusion since we need three more equations here. Before I give you these equations, let me show you the implementation first. See the Codeblock 16c below.

# Codeblock 16c
    def backward_diffusion(self, current_image, predicted_noise, t):  #(1)
        denoised_image = (current_image - (self.sqrt_one_minus_alphas_cum_prod[t] * predicted_noise)) / self.sqrt_alphas_cum_prod[t]  #(2)
        denoised_image = 2 * (denoised_image - denoised_image.min()) / (denoised_image.max() - denoised_image.min()) - 1  #(3)
        
        current_prediction = current_image - ((self.betas[t] * predicted_noise) / (self.sqrt_one_minus_alphas_cum_prod[t]))  #(4)
        current_prediction = current_prediction / torch.sqrt(self.alphas[t])  #(5)
        
        if t == 0:  #(6)
            return current_prediction, denoised_image
        
        else:
            variance = (1 - self.alphas_cum_prod[t-1]) / (1. - self.alphas_cum_prod[t])  #(7)
            variance = variance * self.betas[t]  #(8)
            sigma = variance ** 0.5
            z = torch.randn(current_image.shape).to(DEVICE)
            current_prediction = current_prediction + sigma*z
            
            return current_prediction, denoised_image

Later in the inference phase, the backward_diffusion() method will be called inside a loop that iterates NUM_TIMESTEPS (1000) times, starting from t = 999, continued with t = 998, and so on all the way to t = 0. This function is responsible to remove the noise from the image iteratively based on the current_image (the image produced by the previous denoising step), the predicted_noise (the noise predicted by U-Net in the previous step), and the timestep information t (#(1)). In each iteration, noise removal is done using the equation shown in Figure 12, which in Codeblock 16c, this corresponds to lines #(4-5).

Figure 12. The equation used for removing noise from the image [3].

As long as we haven’t reached t = 0, we will compute the variance based on the equation in Figure 13 (#(7–8)). This variance will then be used to introduce another controlled noise to simulate the stochasticity in the backward diffusion process since the noise removal equation in Figure 12 is a deterministic approximation. This is essentially also the reason that we don’t calculate the variance once we reached t = 0 (#(6)) since we no longer need to add more noise as the image is completely clear already.

Figure 13. The equation used to calculate variance for introducing controlled noise [3].

Different from current_prediction which aims to estimate the image of the previous timestep (xₜ₋₁), the objective of the denoised_image tensor is to reconstruct the original image (x₀). Thanks to these different objectives, we need a separate equation to compute denoised_image, which can be seen in Figure 14 below. The implementation of the equation itself is written at line #(2–3).

Figure 14. The equation for reconstructing the original image [3].

Now let’s test the NoiseScheduler class we created above. In the following codeblock, I instantiate a NoiseScheduler object and print out the attributes associated with it, which are all computed using the equation in Figure 10 based on the values stored in the betas attribute. Remember that the actual length of these tensors is NUM_TIMESTEPS (1000), but here I only print out the first 6 elements.

# Codeblock 17
noise_scheduler = NoiseScheduler()

print(f'betas\t\t\t\t: {noise_scheduler.betas[:6]}')
print(f'alphas\t\t\t\t: {noise_scheduler.alphas[:6]}')
print(f'alphas_cum_prod\t\t\t: {noise_scheduler.alphas_cum_prod[:6]}')
print(f'sqrt_alphas_cum_prod\t\t: {noise_scheduler.sqrt_alphas_cum_prod[:6]}')
print(f'sqrt_one_minus_alphas_cum_prod\t: {noise_scheduler.sqrt_one_minus_alphas_cum_prod[:6]}')

# Codeblock 17 Output
betas                          : tensor([1.0000e-04, 1.1992e-04, 1.3984e-04, 1.5976e-04, 1.7968e-04, 1.9960e-04])
alphas                         : tensor([0.9999, 0.9999, 0.9999, 0.9998, 0.9998, 0.9998])
alphas_cum_prod                : tensor([0.9999, 0.9998, 0.9996, 0.9995, 0.9993, 0.9991])
sqrt_alphas_cum_prod           : tensor([0.9999, 0.9999, 0.9998, 0.9997, 0.9997, 0.9996])
sqrt_one_minus_alphas_cum_prod : tensor([0.0100, 0.0148, 0.0190, 0.0228, 0.0264, 0.0300])

The above output indicates that our __init__() method works as expected. Next, we are going to test the forward_diffusion() method. If you go back to Figure 16b, you will see that forward_diffusion() accepts three inputs: original image, noise image and the timestep number. Let’s just use the image from the MNIST dataset we loaded earlier for the first input (#(1)) and a random Gaussian noise of the exact same size for the second one (#(2)). Run the Codeblock 18 below to see what these two images look like.

# Codeblock 18
image = images[0]  #(1)
noise = torch.randn_like(image)  #(2)

plt.imshow(image.squeeze(), cmap='gray')
plt.show()
plt.imshow(noise.squeeze(), cmap='gray')
plt.show()

Figure 15. The two images to be used as the original (left) and the noise image (right). The one on the left is the same image I showed earlier in Figure 9 [3].

As we already got the image and the noise ready, what we need to do afterwards is to pass them to the forward_diffusion() method alongside the t. I actually tried to run the Codeblock 19 below multiple times with t = 50, 100, 150, and so on up to t = 300. You can see in Figure 16 that the image becomes less clear as the parameter increases. In this case, the image is going to be completely filled by noise when the t is set to 999.

# Codeblock 19
noisy_image_test = noise_scheduler.forward_diffusion(image.to(DEVICE), noise.to(DEVICE), t=50)

plt.imshow(noisy_image_test[0].squeeze().cpu(), cmap='gray')
plt.show()

Figure 16. The result of the forward diffusion process at t=50, 100, 150, and so on until t=300 [3].

Unfortunately, we cannot test the backward_diffusion() method since this process requires us to have our U-Net model trained. So, let’s just skip this part for now. I’ll show you how we can actually use this function later in the inference phase.

Training

As the U-Net model, MNIST dataset, and the noise scheduler are ready, we can now prepare a function for training. Before we do that, I instantiate the model and the noise scheduler in Codeblock 20 below.

# Codeblock 20
model = UNet().to(DEVICE)
noise_scheduler = NoiseScheduler()

The entire training procedure is implemented in the train() function shown in Codeblock 21. Before doing anything, we first initialize the optimizer and the loss function, which in this case we use Adam and MSE, respectively (#(1–2)). What we basically want to do here is to train the model such that it will be able to predict the noise contained in the input image, which later on, the predicted noise will be used as the basis of the denoising process in the backward diffusion stage. To actually train the model, we first need to perform forward diffusion using the code at line #(6). This noising process will be done on the images tensor (#(3)) using the random noise generated at line #(4). Next, we take random number somewhere between 0 and NUM_TIMESTEPS (1000) for the t (#(5)), which is essentially done because we want our model to see images of varying noise levels as an approach to improve generalization. As the noisy images have been generated, we then pass it through the U-Net model alongside the chosen t (#(7)). The input t here is useful for the model as it indicates the current noise level in the image. Lastly, the loss function we initialized earlier is responsible to compute the difference between the actual noise and the predicted noise from the original image (#(8)). So, the objective of this training is basically to make the predicted noise as similar as possible to the noise we generated at line #(4).

# Codeblock 21
def train():
    optimizer = Adam(model.parameters(), lr=LEARNING_RATE)  #(1)
    loss_function = nn.MSELoss()  #(2)
    losses = []
    
    for epoch in range(NUM_EPOCHS):
        print(f'Epoch no {epoch}')
        
        for images, _ in tqdm(loader):
            
            optimizer.zero_grad()

            images = images.float().to(DEVICE)  #(3)
            noise = torch.randn_like(images)  #(4)
            t = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,))  #(5)

            noisy_images = noise_scheduler.forward_diffusion(images, noise, t).to(DEVICE)  #(6)
            predicted_noise = model(noisy_images, t)  #(7)
            loss = loss_function(predicted_noise, noise)  #(8)
            
            losses.append(loss.item())
            loss.backward()
            optimizer.step()

    return losses

Now let’s run the above training function using the codeblock below. Sit back and relax while waiting the training completes. In my case, I used Kaggle Notebook with Nvidia GPU P100 turned on, and it took around 45 minutes to finish.

# Codeblock 22
losses = train()

If we take a look at the loss graph, it seems like our model learned pretty well as the value is generally decreasing over time with a rapid drop at early stages and a more stable (yet still decreasing) trend in the later stages. So, I think we can expect good results later in the inference phase.

# Codeblock 23
plt.plot(losses)

Figure 17. How the loss value decreases as the training goes [3].

Inference

At this point we have already got our model trained, so we can now perform inference on it. Look at the Codeblock 24 below to see how I implement the inference() function.

# Codeblock 24
def inference():

    denoised_images = []  #(1)
    
    with torch.no_grad():  #(2)
        current_prediction = torch.randn((64, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)  #(3)
        
        for i in tqdm(reversed(range(NUM_TIMESTEPS))):  #(4)
            predicted_noise = model(current_prediction, torch.as_tensor(i).unsqueeze(0))  #(5)
            current_prediction, denoised_image = noise_scheduler.backward_diffusion(current_prediction, predicted_noise, torch.as_tensor(i))  #(6)

            if i%100 == 0:  #(7)
                denoised_images.append(denoised_image)
            
        return denoised_images

At the line marked with #(1) I initialize an empty list which will be used to store the denoising result every 100 timesteps (#(7)). This will later allow us to see how the backward diffusion goes. The actual inference process is encapsulated inside torch.no_grad() (#(2)). Remember that in diffusion models we generate images from a completely random noise, which we assume that these images are initially at t = 999. To implement this, we can simply use torch.randn() as shown at line #(3). Here we initialize a tensor of size 64×1×28×28, indicating that we are about to generate 64 images simultaneously. Next, we write a for loop that iterates backwards starting from 999 to 0 (#(4)). Inside this loop, we feed the current image and the timestep as the input for the trained U-Net and let it predict the noise (#(5)). The actual backward diffusion is then performed at line #(6). At the end of the iteration, we should get new images similar to the ones we have in our dataset. Now let’s call the inference() function in the following codeblock.

# Codeblock 25
denoised_images = inference()

As the inference completed, we can now see what the resulting images look like. The Codeblock 26 below is used to display the first 42 images we just generated.

# Codeblock 26
fig, axes = plt.subplots(ncols=7, nrows=6, figsize=(10, 8))

counter = 0

for i in range(6):
    for j in range(7):
        axes[i,j].imshow(denoised_images[-1][counter].squeeze().detach().cpu().numpy(), cmap='gray')  #(1)
        axes[i,j].get_xaxis().set_visible(False)
        axes[i,j].get_yaxis().set_visible(False)
        counter += 1

plt.show()

Figure 18. The images generated by the diffusion model trained on the MNIST Handwritten Digit dataset [3].

If we take a look at the above codeblock, you can see that the indexer of [-1] at line #(1) indicates that we only display the images from the last iteration (which corresponds to timestep 0). This is the reason that the images you see in Figure 18 are all free from noise. I do acknowledge that this might not be the best of a result since not all the generated images are valid digit numbers. — But hey, this instead indicates that these images are not merely duplicates from the original dataset.

Here we can also visualize the backward diffusion process using the Codeblock 27 below. You can see in the resulting output in Figure 19 that we initially start from a complete random noise, which gradually disappears as we move to the right.

# Codeblock 27
fig, axes = plt.subplots(ncols=10, figsize=(24, 8))

sample_no = 0
timestep_no = 0

for i in range(10):
    axes[i].imshow(denoised_images[timestep_no][sample_no].squeeze().detach().cpu().numpy(), cmap='gray')
    axes[i].get_xaxis().set_visible(False)
    axes[i].get_yaxis().set_visible(False)
    timestep_no += 1

plt.show()

Figure 19. What the image looks like at timestep 900, 800, 700 and so on until timestep 0 [3].

Ending

There are plenty of directions you can go from here. First, you might probably need to tweak the parameter configurations in Codeblock 2 if you want better results. Second, it is also possible to modify the U-Net model by implementing attention layers in addition to the stack of convolution layers we used in the downsampling and the upsampling stages. This does not guarantee you to obtain better results especially for a simple dataset like this, but it’s definitely worth trying. Third, you can also try to use a more complex dataset if you want to challenge yourself.

When it comes to practical applications, there are actually lots of things you can do with diffusion models. The simplest one might be for data augmentation. With diffusion model, we can easily generate new images from a specific data distribution. For example, suppose we are working on an image classification project, but the number of images in the classes are imbalanced. To address this problem, it is possible for us to take the images from the minority class and feed them into a diffusion model. By doing so, we can ask the trained diffusion model to generate a number of samples from that class as many as we want.

And well, that’s pretty much everything about the theory and the implementation of diffusion model. Thanks for reading, I hope you learn something new today!

You can access the code used in this project through this link. Here are also the links to my previous articles about Autoencoder, Variational Autoencoder (VAE), Neural Style Transfer (NST), and Transformer.

References

[1] Jascha Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Arxiv. https://arxiv.org/pdf/1503.03585 [Accessed December 27, 2024].

[2] Jonathan Ho et al. Denoising Diffusion Probabilistic Models. Arxiv. https://arxiv.org/pdf/2006.11239 [Accessed December 27, 2024].

[3] Image created originally by author.

[4] Olaf Ronneberger et al. U-Net: Convolutional Networks for Biomedical
Image Segmentation. Arxiv. https://arxiv.org/pdf/1505.04597 [Accessed December 27, 2024].

[5] Yann LeCun et al. The MNIST Database of Handwritten Digits. https://yann.lecun.com/exdb/mnist/ [Accessed December 30, 2024] (Creative Commons Attribution-Share Alike 3.0 license).

[6] Ashish Vaswani et al. Attention Is All You Need. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed September 29, 2024].

The post The Art of Noise appeared first on Towards Data Science.

The Art of Hybrid Architectures

Eric Chung — Sat, 29 Mar 2025 03:38:17 +0000

In my previous article, I discussed how morphological feature extractors mimic the way biological experts visually assess images.

This time, I want to go a step further and explore a new question:
Can different architectures complement each other to build an AI that “sees” like an expert?

Introduction: Rethinking Model Architecture Design

While building a high accuracy visual recognition model, I ran into a key challenge:

How do we get AI to not just “see” an image, but actually understand the features that matter?

Traditional CNNs excel at capturing local details like fur texture or ear shape, but they often miss the bigger picture. Transformers, on the other hand, are great at modeling global relationships, how different regions of an image interact, but they can easily overlook fine-grained cues.

This insight led me to explore combining the strengths of both architectures to create a model that not only captures fine details but also comprehends the bigger picture.

While developing PawMatchAI, a 124-breed dog classification system, I went through three major architectural phases:

1. Early Stage: EfficientNetV2-M + Multi-Head Attention

I started with EfficientNetV2-M and added a multi-head attention module.

I experimented with 4, 8, and 16 heads—eventually settling on 8, which gave the best results.

This setup reached an F1 score of 78%, but it felt more like a technical combination than a cohesive design.

2. Refinement: Focal Loss + Advanced Data Augmentation

After closely analyzing the dataset, I noticed a class imbalance, some breeds appeared far more frequently than others, skewing the model’s predictions.

To address this, I introduced Focal Loss, along with RandAug and mixup, to make the data distribution more balanced and diverse.
This pushed the F1 score up to 82.3%.

3. Breakthrough: Switching to ConvNextV2-Base + Training Optimization

Next, I replaced the backbone with ConvNextV2-Base, and optimized the training using OneCycleLR and a progressive unfreezing strategy.
The F1 score climbed to 87.89%.

But during real-world testing, the model still struggled with visually similar breeds, indicating room for improvement in generalization.

4. Final Step: Building a Truly Hybrid Architecture

After reviewing the first three phases, I realized the core issue: stacking technologies isn’t the same as getting them to work together.

What I needed was true collaboration between the CNN, the Transformer, and the morphological feature extractor, each playing to its strengths. So I restructured the entire pipeline.

ConvNextV2 was in charge of extracting detailed local features.
The morphological module acted like a domain expert, highlighting features critical for breed identification.

Finally, the multi-head attention brought it all together by modeling global relationships.

This time, they weren’t just independent modules, they were a team.
CNNs identified the details, the morphology module amplified the meaningful ones, and the attention mechanism tied everything into a coherent global view.

Key Result: The F1 score rose to 88.70%, but more importantly, this gain came from the model learning to understand morphology, not just memorize textures or colors.

It started recognizing subtle structural features—just like a real expert would—making better generalizations across visually similar breeds.

If you’re interested, I’ve written more about morphological feature extractors here.

These extractors mimic how biological experts assess shape and structure, enhancing critical visual cues like ear shape and body proportions.

They’re a vital part of this hybrid design, filling the gaps traditional models tend to overlook.

In this article, I’ll walk through:

The strengths and limitations of CNNs vs. Transformers—and how they can complement each other
Why I ultimately chose ConvNextV2 over EfficientNetV2
The technical details of multi-head attention and how I decided the number of heads
How all these elements came together in a unified hybrid architecture
And finally, how heatmaps reveal that the AI is learning to “see” key features, just like a human expert

1. The Strengths and Limitations of CNNs and Transformers

In the previous section, I discussed how CNNs and Transformers can effectively complement each other. Now, let’s take a closer look at what sets each architecture apart, their individual strengths, limitations, and how their differences make them work so well together.

1.1 The Strength of CNNs: Great with Details, Limited in Scope

CNNs are like meticulous artists, they can draw fine lines beautifully, but often miss the bigger composition.

Strong at Local Feature Extraction
CNNs are excellent at capturing edges, textures, and shapes—ideal for distinguishing fine-grained features like ear shapes, nose proportions, and fur patterns across dog breeds.

Computational Efficiency
With parameter sharing, CNNs process high-resolution images more efficiently, making them well-suited for large-scale visual tasks.

Translation Invariance
Even when a dog’s pose varies, CNNs can still reliably identify its breed.

That said, CNNs have two key limitations:

Limited Receptive Field:
CNNs expand their field of view layer by layer, but early-stage neurons only “see” small patches of pixels. As a result, it’s difficult for them to connect features that are spatially far apart.

For instance: When identifying a German Shepherd, the CNN might spot upright ears and a sloped back separately, but struggle to associate them as defining characteristics of the breed.

Lack of Global Feature Integration:
CNNs excel at local stacking of features, but they’re less adept at combining information from distant regions.

Example: To distinguish a Siberian Husky from an Alaskan Malamute, it’s not just about one feature, it’s about the combination of ear shape, facial proportions, tail posture, and body size. CNNs often struggle to consider these elements holistically.

1.2 The Strength of Transformers: Global Awareness, But Less Precise

Transformers are like master strategists with a bird’s-eye view, they quickly spot patterns, but aren’t great at filling in the fine details.

Capturing Global Context
Thanks to their self-attention mechanism, Transformers can directly link any two features in an image, no matter how far apart they are.

Dynamic Attention Weighting
Unlike CNNs’ fixed kernels, Transformers dynamically allocate focus based on context.

Example: When identifying a Poodle, the model may prioritize fur texture; when it sees a Bulldog, it might focus more on facial structure.

But Transformers also have two major drawbacks:

High Computational Cost:
Self-attention has a time complexity of O(n²). As image resolution increases, so does the cost—making training more intensive.

Weak at Capturing Fine Details:
Transformers lack CNNs’ “built-in intuition” that nearby pixels are usually related.

Example: On their own, Transformers might miss subtle differences in fur texture or eye shape, details that are crucial for distinguishing visually similar breeds.

1.3 Why a Hybrid Architecture Is Necessary

Let’s take a real world case:

How do you distinguish a Golden Retriever from a Labrador Retriever?

They’re both beloved family dogs with similar size and temperament. But experts can easily tell them apart by observing:

Golden Retrievers have long, dense coats ranging from golden to dark gold, more elongated heads, and distinct feathering around ears, legs, and tails.
Labradors, on the other hand, have short, double-layered coats, more compact bodies, rounder heads, and thick otter-like tails. Their coats come in yellow, chocolate, or black.

Interestingly, for humans, this distinction is relatively easy, “long hair vs. short hair” might be all you need.

But for AI, relying solely on coat length (a texture-based feature) is often unreliable. Lighting, image quality, or even a trimmed Golden Retriever can confuse the model.

When analyzing this challenge, we can see…

The problem with using only CNNs:

While CNNs can detect individual features like “coat length” or “tail shape,” they struggle with combinations like “head shape + fur type + body structure.” This issue worsens when the dog is in a different pose.

The problem with using only Transformers:

Transformers can associate features across the image, but they’re not great at picking up fine-grained cues like slight variations in fur texture or subtle head contours. They also require large datasets to achieve expert-level performance.
Plus, their computational cost increases sharply with image resolution, slowing down training.

These limitations highlight a core truth:

Fine-grained visual recognition requires both local detail extraction and global relationship modeling.

A truly expert system like a veterinarian or show judge must inspect features up close while understanding the overall structure. That’s exactly where hybrid architectures shine.

1.4 The Advantages of a Hybrid Architecture

This is why we need hybrid systems architectures that combine CNNs’ precision in local features with Transformers’ ability to model global relationships:

CNNs: Extract local, fine-grained features like fur texture and ear shape, crucial for spotting subtle differences.
Transformers: Capture long-range dependencies (e.g., head shape + body size + eye color), allowing the model to reason holistically.
Morphological Feature Extractors: Mimic human expert judgment by emphasizing diagnostic features, bridging the gap left by data-driven models.

Such an architecture not only boosts evaluation metrics like the F1 Score, but more importantly, it enables the AI to genuinely understand the subtle distinctions between breeds, getting closer to the way human experts think. The model learns to weigh multiple features together, instead of over-relying on one or two unstable cues.

In the next section, I’ll dive into how I actually built this hybrid architecture, especially how I selected and integrated the right components.

2. Why I Chose ConvNextV2: Key Innovations Behind the Backbone

Among the many visual recognition architectures available, why did I choose ConvNextV2 as the backbone of my project?

Because its design effectively combines the best of both worlds: the CNN’s ability to extract precise local features, and the Transformer’s strength in capturing long-range dependencies.

Let’s break down three core innovations that made it the right fit.

2.1 FCMAE Self-Supervised Learning: Adaptive Learning Inspired by the Human Brain

Imagine learning to navigate with your eyes covered, your brain becomes laser-focused on memorizing the details you can perceive.

ConvNextV2 uses a self-supervised pretraining strategy similar to that of Vision Transformers.

During training, up to 60% of input pixels are intentionally masked, and the model must learn to reconstruct the missing regions.
This “make learning harder on purpose” approach actually leads to three major benefits:

Comprehensive Feature Learning
The model learns the underlying structure and patterns of an image—not just the most obvious visual cues.
In the context of breed classification, this means it pays attention to fur texture, skeletal structure, and body proportions, instead of relying solely on color or shape.
Reduced Dependence on Labeled Data
By pretraining on unlabeled dog images, the model develops strong visual representations.
Later, with just a small amount of labeled data, it can fine-tune effectively—saving significant annotation effort.
Improved Recognition of Rare Patterns
The reconstruction task pushes the model to learn generalized visual rules, enhancing its ability to identify rare or underrepresented breeds.

2.2 GRN Global Calibration: Mimicking an Expert’s Attention

Like a seasoned photographer who adjusts the exposure of each element to highlight what truly matters.

GRN (Global Response Normalization) is arguably the most impactful innovation in ConvNextV2, giving CNNs a degree of global awareness that was previously lacking:

Dynamic Feature Recalibration
GRN globally normalizes the feature map, amplifying the most discriminative signals while suppressing irrelevant ones.
For instance, when identifying a German Shepherd, it emphasizes upright ears and the sloped back while minimizing background noise.
Enhanced Sensitivity to Subtle Differences
This normalization sharpens feature contrast, making it easier to spot fine-grained differences—critical for telling apart breeds like the Siberian Husky and Alaskan Malamute.
Focus on Diagnostic Features
GRN helps the model prioritize features that truly matter for classification, rather than relying on statistically correlated but causally irrelevant cues.

2.3 Sparse and Efficient Convolutions: More with Less

Like a streamlined team where each member plays to their strengths, reducing redundancy while boosting performance.

ConvNextV2 incorporates architectural optimizations such as depthwise separable convolutions and sparse connections, resulting in three major gains:

Improved Computational Efficiency
By breaking down convolutions into smaller, more efficient steps, the model reduces its computational load.
This allows it to process high-resolution dog images and detect fine visual differences without requiring excessive resources.
Expanded Effective Receptive Field
The layout of convolutions is designed to extend the model’s field of view, helping it analyze both overall body structure and local details simultaneously.
Parameter Efficiency
The architecture ensures that each parameter carries more learning capacity, extracting richer, more nuanced information using the same amount of compute.

2.4 Why ConvNextV2 Was the Right Fit for a Hybrid Architecture

ConvNextV2 turned out to be the perfect backbone for this hybrid system, not just because of its performance, but because it embodies the very philosophy of fusion.

It retains the local precision of CNNs while adopting key design concepts from Transformers to expand its global awareness. This duality makes it a natural bridge between CNNs and Transformers apable of preserving fine-grained details while understanding the broader context.

It also lays the groundwork for additional modules like multi-head attention and morphological feature extractors, ensuring the model starts with a complete, balanced feature set.

In short, ConvNextV2 doesn’t just “see the parts”, it starts to understand how the parts come together. And in a task like dog breed classification, where both minute differences and overall structure matter, this kind of foundation is what transforms an ordinary model into one that can reason like an expert.

3. Technical Implementation of the MultiHeadAttention Mechanism

In neural networks, the core concept of the attention mechanism is to enable models to “focus” on key parts of the input, similar to how human experts consciously focus on specific features (such as ear shape, muzzle length, tail posture) when identifying dog breeds.
The Multi-Head Attention (MHA) mechanism further enhances this ability:

“Rather than having one expert evaluate all features, it’s better to form a panel of experts, letting each focus on different details, and then synthesize a final judgment!”

Mathematically, MHA uses multiple linear projections to allow the model to simultaneously learn different feature associations, further enhancing performance.

3.1 Understanding MultiHeadAttention from a Mathematical Perspective

The core idea of MultiHeadAttention is to use multiple different projections to allow the model to simultaneously attend to patterns in different subspaces. Mathematically, it first projects input features into three roles: Query, Key, and Value, then calculates the similarity between Query (Q) and Key (K), and uses this similarity to perform weighted averaging of Values.

The basic formula can be expressed as:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

3.2 Application of Einstein Summation Convention in Attention Calculation

In the implementation, I used the torch.einsum function based on the Einstein summation convention to efficiently calculate attention scores:

energy = torch.einsum("nqd,nkd->nqk", [q, k])

This means:
q has shape (batch_size, num_heads, query_dim)
k has shape (batch_size, num_heads, key_dim)
The dot product is performed on dimension d, resulting in (batch_size, num_heads, query_len, key_len) This is essentially “calculating similarity between each Query and all Keys,” generating an attention weight matrix

3.3 Implementation Code Analysis

Key implementation code for MultiHeadAttention:

def forward(self, x):

    N = x.shape[0]  # batch size

    # 1. Project input, prepare for multi-head attention calculation
    x = self.fc_in(x)  # (N, input_dim) → (N, scaled_dim)

    # 2. Calculate Query, Key, Value, and reshape into multi-head form
    q = self.query(x).view(N, self.num_heads, self.head_dim)  # query
    k = self.key(x).view(N, self.num_heads, self.head_dim)    # key
    v = self.value(x).view(N, self.num_heads, self.head_dim)  # value

    # 3. Calculate attention scores (similarity matrix)
    energy = torch.einsum("nqd,nkd->nqk", [q, k])

    # 4. Apply softmax (normalize weights) and perform scaling
    attention = F.softmax(energy / (self.head_dim ** 0.5), dim=2)

    # 5. Use attention weights to perform weighted sum on Value
    out = torch.einsum("nqk,nvd->nqd", [attention, v])

    # 6. Rearrange output and pass through final linear layer
    out = out.reshape(N, self.scaled_dim)
    out = self.fc_out(out)

    return out

3.3.1. Steps 1-2: Projection and Multi-Head Splitting
First, input features are projected through a linear layer, and then separately projected into query, key, and value spaces. Importantly, these projections not only change the feature representation but also split them into multiple “heads,” each attending to different feature subspaces.

3.3.2. Steps 3-4: Attention Calculation

3.3.3. Steps 5-6: Weighted Aggregation and Output Projection
Using the calculated attention weights, weighted summation is performed on the value vectors to obtain the attended feature representation. Finally, outputs from all heads are concatenated and passed through an output projection layer to get the final result.

This implementation has the following simplifications and adjustments compared to standard Transformer MultiHeadAttention: Query, key, and value come from the same input (self-attention), suitable for processing features obtained from CNN backbone networks.

It uses einsum operations to simplify matrix calculations.

The design of projection layers ensures dimensional consistency, facilitating integration with other modules.

3.4 How Attention Mechanisms Enhance Understanding of Morphological Feature Relationships

The multi-head attention mechanism brings three core advantages to dog breed recognition:

3.4.1. Feature Relationship Modeling

Just as a professional veterinarian not only sees that ears are upright but also notices how this combines with tail curl degree and skull shape to form a dog breed’s “feature combination.”

It can establish associations between different morphological features, capturing their synergistic relationships, not just seeing “what features exist” but observing “how these features combine.”

Application: The model can learn that a combination of “pointed ears + curled tail + medium build” points to specific Northern dog breeds.

3.4.2. Dynamic Feature Importance Assessment

Just as experts know to focus particularly on fur texture when identifying Poodles, while focusing mainly on the distinctive nose and head structure when identifying Bulldogs.

It dynamically adjusts focus on different features based on the specific content of the input.

Key features vary across different breeds, and the attention mechanism can adaptively focus.

Application: When seeing a Border Collie, the model might focus more on fur color distribution; when seeing a Dachshund, it might focus more on body proportions

3.4.3. Complementary Information Integration

Like a team of experts with different specializations, one focusing on skeletal structure, another on fur features, another analyzing behavioral posture, making a more comprehensive judgment together.

Through multiple attention heads, each simultaneously captures different types of feature relationships. Each head can focus on a specific type of feature or relationship pattern.

Application: One head might primarily focus on color patterns, another on body proportions, and yet another on facial features, ultimately synthesizing these perspectives to make a judgment.

By combining these three capabilities, the MultiHeadAttention mechanism goes beyond identifying individual features, it learns to model the complex relationships between them, capturing subtle patterns that emerge from their combinations and enabling more accurate recognition.

4. Implementation Details of the Hybrid Architecture

4.1 The Overall Architectural Flow

When designing this hybrid architecture, my goal was simple yet ambitious:

Let each component do what it does best, and build a complementary system where they enhance one another.

Much like a well-orchestrated symphony, each instrument (or module) plays its role, only together can they create harmony.
In this setup:

The CNN focuses on capturing local details.
The morphological feature extractor enhances key structural features.
The multi-head attention module learns how these features interact.

As shown in the diagram above, the overall model operates through five key stages:

4.1.1. Feature Extraction

Once an image enters the model, ConvNextV2 takes charge of extracting foundational features, such as fur color, contours, and texture. This is where the AI begins to “see” the basic shape and appearance of the dog.

4.1.2. Morphological Feature Enhancement

These initial features are then refined by the morphological feature extractor. This module functions like an expert’s eye—highlighting structural characteristics such as ear shape and body proportions. Here, the AI learns to focus on what actually matters.

4.1.3. Feature Fusion

Next comes the feature fusion layer, which merges the local features with the enhanced morphological cues. But this isn’t just a simple concatenation, the layer also models how these features interact, ensuring the AI doesn’t treat them in isolation, but rather understands how they combine to convey meaning.

4.1.4. Feature Relationship Modeling

The fused features are passed into the multi-head attention module, which builds contextual relationships between different attributes. The model begins to understand combinations like “ear shape + fur texture + facial proportions” rather than looking at each trait independently.

4.1.5. Final Classification

After all these layers of processing, the model moves to its final classifier, where it makes a prediction about the dog’s breed, based on the rich, integrated understanding it has developed.

4.2 Integrating ConvNextV2 and Parameter Setup

For implementation, I chose the pretrained ConvNextV2-base model as the backbone:

self.backbone = timm.create_model(
    'convnextv2_base',
    pretrained=True,
    num_classes=0)  # Use only the feature extractor; remove original classification head

Depending on the input image size or backbone architecture, the feature output dimensions may vary. To build a robust and flexible system, I designed a dynamic feature dimension detection mechanism:

with torch.no_grad():
    dummy_input = torch.randn(1, 3, 224, 224)
    features = self.backbone(dummy_input)
    if len(features.shape) > 2:
        features = features.mean([-2, -1])  # Global average pooling to produce a 1D feature vector
    self.feature_dim = features.shape[1]

This ensures the system automatically adapts to any feature shape changes, keeping all downstream components functioning properly.

4.3 Intelligent Configuration of the Multi-Head Attention Layer

As mentioned earlier, I experimented with several head counts. Too many heads increased computation and risked overfitting. I ultimately settled on eight, but allowed the number of heads to adjust automatically based on feature dimensions:

self.num_heads = max(1, min(8, self.feature_dim // 64))
self.attention = MultiHeadAttention(self.feature_dim, num_heads=self.num_heads)

4.4 Making CNN, Transformers, and Morphological Features Work Together

The morphological feature extractor works hand-in-hand with the attention mechanism.

While the former provides structured representations of key traits, the latter models relationships among these features:

# Feature fusion
combined_features = torch.cat([
    features,  # Base features
    morphological_features,  # Morphological features
    features * morphological_features  # Interaction between features
], dim=1)
fused_features = self.feature_fusion(combined_features)

# Apply attention
attended_features = self.attention(fused_features)

# Final classification
logits = self.classifier(attended_features)

return logits, attended_features

A special note about the third component features * morphological_features — this isn’t just a mathematical multiplication. It creates a form of dialogue between the two feature sets, allowing them to influence each other and generate richer representations.

For example, suppose the model picks up “pointy ears” from the base features, while the morphological module detects a “small head-to-body ratio.”

Individually, these may not be conclusive, but their interaction may strongly suggest a specific breed, like a Corgi or Finnish Spitz. It’s no longer just about recognizing ears or head size, the model learns to interpret how features work together, much like an expert would.
This full pipeline from feature extraction, through morphological enhancement and attention-driven modeling, to prediction is my vision of what an ideal architecture should look like.

The design has several key advantages:

The morphological extractor brings structured, expert-inspired understanding.
The multi-head attention uncovers contextual relationships between traits.
The feature fusion layer captures nonlinear interactions through element-wise multiplication.

4.5 Technical Challenges and How I Solved Them

Building a hybrid architecture like this was far from smooth sailing.
Here are several challenges I faced and how solving them helped me improve the overall design:

4.5.1. Mismatched Feature Dimensions

Challenge: Output sizes varied across modules, especially when switching backbone networks.
Solution: In addition to the dynamic dimension detection mentioned earlier, I implemented adaptive projection layers to unify the feature dimensions.

4.5.2. Balancing Performance and Efficiency

Challenge: More complexity meant more computation.
Solution: I dynamically adjusted the number of attention heads, and used efficient einsum operations to optimize performance.

4.5.3. Overfitting Risk

Challenge: Hybrid models are more prone to overfitting, especially with smaller training sets.
Solution: I applied LayerNorm, Dropout, and weight decay for regularization.

4.5.4. Gradient Flow Issues

Challenge: Deep architectures often suffer from vanishing or exploding gradients.
Solution: I introduced residual connections to ensure gradients flow smoothly during both forward and backward passes.

If you’re interested in exploring the full implementation, feel free to check out the GitHub project here.

5. Performance Evaluation and Heatmap Analysis

The value of a hybrid architecture lies not only in its quantitative performance but also in how it qualitatively “thinks.”

In this section, we’ll use confidence score statistics and heatmap analysis to demonstrate how the model evolved from CNN → CNN+Transformer → CNN+Transformer+MFE, and how each stage brought its visual reasoning closer to that of a human expert.

To ensure that the performance differences came purely from architecture design, I retrained each model using the exact same dataset, augmentation methods, loss function, and training parameters. The only variation was the presence or absence of the Transformer and morphological modules.

In terms of F1 score, the CNN-only model reached 87.83%, the CNN+Transformer variant performed slightly better at 89.48%, and the final hybrid model scored 88.70%. While the transformer-only version showed the highest score on paper, it didn’t always translate into more reliable predictions. In fact, the hybrid model was more consistent in practice and handled similar-looking or blurry cases more reliably.

5.1 Confidence Scores and Statistical Insights

I tested 17 images of Border Collies, including standard photos, artistic illustrations, and various camera angles, to thoroughly assess the three architectures.

While other breeds were also included in the broader evaluation, I chose Border Collie as a representative case due to its distinctive features and frequent confusion with similar breeds.

Figure 1: Model Confidence Score Comparison
As shown above, there are clear performance differences across the three models.

A notable example is Sample #3, where the CNN-only model misclassified the Border Collie as a Collie, with a low confidence score of 0.2492.

While the CNN+Transformer corrected this error, it introduced a new one in Sample #5, misidentifying it as a Shiba Inu with 0.2305 confidence.

The final CNN+Transformer+MFE model correctly identified all samples without error. What’s interesting here is that both misclassifications occurred at low confidence levels (below 0.25).
This suggests that even when the model makes a mistake, it retains a sense of uncertainty—a desirable trait in real world applications. We want models to be cautious when unsure, rather than confidently wrong.

Figure 2: Confidence Score Distribution
Looking at the distribution of confidence scores, the improvement becomes even more evident.

The CNN-only model mostly predicted in the 0.4–0.5 range, with few samples reaching beyond 0.6.

CNN+Transformer showed better concentration around 0.5–0.6, but still had only one sample in the 0.7–0.8 high-confidence range.
The CNN+Transformer+MFE model stood out with 6 samples reaching the 0.7–0.8 confidence level.

This rightward shift in distribution reveals more than just accuracy, it reflects certainty.

The model is evolving from “barely correct” to “confidently correct,” which significantly enhances its reliability in real-world deployment.

Figure 3: Statistical Summary of Model Performance
A deeper statistical breakdown highlights consistent improvements:

Mean confidence score rose from 0.4639 (CNN) to 0.5245 (CNN+Transformer), and finally 0.6122 with the full hybrid setup—a 31.9% increase overall.

Median score jumped from 0.4665 to 0.6827, confirming the overall shift toward higher confidence.

The proportion of high-confidence predictions (≥ 0.5) also showed striking gains:

CNN: 41.18%
CNN+Transformer: 64.71%
CNN+Transformer+MFE: 82.35%

This means that with the final architecture, most predictions are not only correct but confidently correct.

You might notice a slight increase in standard deviation (from 0.1237 to 0.1616), which might seem like a negative at first. But in reality, this reflects a more nuanced response to input complexity:

The model is highly confident on easier samples, and appropriately cautious on harder ones. The improvement in maximum confidence value (from 0.6343 to 0.7746) further shows how this hybrid architecture can make more decisive and assured judgments when presented with straightforward samples.

5.2 Heatmap Analysis: Tracing the Evolution of Model Reasoning

While statistical metrics are helpful, they don’t tell the full story.
To truly understand how the model makes decisions, we need to see what it sees and heatmaps make this possible.

In these heatmaps, red indicates areas of high attention, highlighting the regions the model relies on most during prediction. By analyzing these attention maps, we can observe how each model interprets visual information, revealing fundamental differences in their reasoning styles.

Let’s walk through one representative case.

5.2.1 Frontal View of a Border Collie: From Local Eye Focus to Structured Morphological Understanding
When presented with a frontal image of a Border Collie, the three models reveal distinct attention patterns, reflecting how their architectural designs shape visual understanding.

The CNN-only model produces a heatmap with two sharp attention peaks, both centered on the dog’s eyes. This indicates a strong reliance on local features while overlooking other morphological traits like the ears or facial outline. While eyes are indeed important, focusing solely on them makes the model more vulnerable to variations in pose or lighting. The resulting confidence score of 0.5581 reflects this limitation.

With the CNN+Transformer model, the attention becomes more distributed. The heatmap forms a loose M-shaped pattern, extending beyond the eyes to include the forehead and the space between the eyes. This shift suggests that the model begins to understand spatial relationships between features, not just the features themselves. This added contextual awareness leads to a stronger confidence score of 0.6559.

The CNN+Transformer+MFE model shows the most structured and comprehensive attention map. The heat is symmetrically distributed across the eyes, ears, and the broader facial region. This indicates that the model has moved beyond feature detection and is now capturing how features are arranged as part of a meaningful whole. The Morphological Feature Extractor plays a key role here, helping the model grasp the structural signature of the breed. This deeper understanding boosts the confidence to 0.6972.

Together, these three heatmaps represent a clear progression in visual reasoning, from isolated feature detection, to inter-feature context, and finally to structural interpretation. Even though ConvNeXtV2 is already a powerful backbone, adding Transformer and MFE modules enables the model to not just see features but to understand them as part of a coherent morphological pattern. This shift is subtle but crucial, especially for fine-grained tasks like breed classification.

5.2.2 Error Case Analysis: From Misclassification to True Understanding

This is a case where the CNN-only model misclassified a Border Collie.

Looking at the heatmap, we can see why. The model focuses almost entirely on a single eye, ignoring most of the face. This kind of over-reliance on one local feature makes it easy to confuse breeds that share similar traits in this case, a Collie, which also has similar eye shape and color contrast.

What the model misses are the broader facial proportions and structural details that define a Border Collie. Its low confidence score of 0.2492 reflects that uncertainty.

With the CNN+Transformer model, attention shifts in a more promising direction. It now covers both eyes and parts of the forehead, creating a more balanced attention pattern. This suggests the model is beginning to connect multiple features, rather than depending on just one.

Thanks to self-attention, it can better interpret relationships between facial components, leading to the correct prediction — Border Collie. The confidence score rises to 0.5484, more than double the previous model’s.

The CNN+Transformer+MFE model takes this further by improving morphological awareness. The heatmap now extends to the nose and muzzle, capturing nuanced traits like facial length and mouth shape. These are subtle but important cues that help distinguish herding breeds from one another.

The MFE module seems to guide the model toward structural combinations, not just isolated features. As a result, confidence increases again to 0.5693, showing a more stable, breed-specific understanding.

This progression from a narrow focus on a single eye, to integrating facial traits, and finally to interpreting structural morphology, highlights how hybrid models support more accurate and generalizable visual reasoning.

In this example, the CNN-only model focuses almost entirely on one side of the dog’s face. The rest of the image is nearly ignored. This kind of narrow attention suggests the model didn’t have enough visual context to make a strong decision. It guessed correctly this time, but with a low confidence score of 0.2238, it’s clear that the prediction wasn’t based on solid reasoning.

The CNN+Transformer model shows a broader attention span, but it introduces a different issue, the heatmap becomes scattered. You can even spot a strong attention spike on the far right, completely unrelated to the dog. This kind of misplaced focus likely led to a misclassification as a Shiba Inu, and the confidence score was still low at 0.2305.

This highlights an important point:

Adding a Transformer doesn’t guarantee better judgment unless the model learns where to look. Without guidance, self-attention can amplify the wrong signals and create confusion rather than clarity.

With the CNN+Transformer+MFE model, the attention becomes more focused and structured. The model now looks at key regions like the eyes, nose, and chest, building a more meaningful understanding of the image. But even here, the confidence remains low at 0.1835, despite the correct prediction. This image clearly presented a real challenge for all three models.

That’s what makes this case so interesting.

It reminds us that a correct prediction doesn’t always mean the model was confident. In harder scenarios unusual poses, subtle features, cluttered backgrounds even the most advanced models can hesitate.

And that’s where confidence scores become invaluable.
They help flag uncertain cases, making it easier to design review pipelines where human experts can step in and verify tricky predictions.

5.2.3 Recognizing Artistic Renderings: Testing the Limits of Generalization

Artistic images pose a unique challenge for visual recognition systems. Unlike standard photos with crisp textures and clear lighting, painted artworks are often abstract and distorted. This forces models to rely less on superficial cues and more on deeper, structural understanding. In that sense, they serve as a perfect stress test for generalization.

Let’s see how the three models handle this scenario.

Starting with the CNN-only model, the attention map is scattered, with focus diffused across both sides of the image. There’s no clear structure — just a vague attempt to “see everything,” which usually means the model is unsure what to focus on. That uncertainty is reflected in its confidence score of 0.5394, sitting in the lower-mid range. The model makes the correct guess, but it’s far from confident.

Next, the CNN+Transformer model shows a clear improvement. Its attention sharpens and clusters around more meaningful regions, particularly near the eyes and ears. Even with the stylized brushstrokes, the model seems to infer, “this could be an ear” or “that looks like the facial outline.” It’s starting to map anatomical cues, not just visual textures. The confidence score rises to 0.6977, suggesting a more structured understanding is taking shape.

Finally, we look at the CNN+Transformer+MFE hybrid model. This one locks in with precision. The heatmap centers tightly on the intersection of the eyes and nose — arguably the most distinctive and stable region for identifying a Border Collie, even in abstract form. It’s no longer guessing based on appearance. It’s reading the dog’s underlying structure.

This leap is largely thanks to the MFE, which helps the model focus on features that persist, even when style or detail varies. The result? A confident score of 0.7457, the highest among all three.

This experiment makes something clear:

Hybrid models don’t just get better at recognition, they get better at reasoning.

They learn to look past visual noise and focus on what matters most: structure, proportion, and pattern. And that’s what makes them reliable, especially in the unpredictable, messy real world of images.

Conclusion

As deep learning evolves, we’ve moved from CNNs to Transformers—and now toward hybrid architectures that combine the best of both. This shift reflects a broader change in AI design philosophy: from seeking purity to embracing fusion.

Think of it like cooking. Great chefs don’t insist on one technique. They mix sautéing, boiling, and frying depending on the ingredient. Similarly, hybrid models combine different architectural “flavors” to suit the task at hand.

This fusion design offers several key benefits:

Complementary strengths: Like combining a microscope and a telescope, hybrid models capture both fine details and global context.
Structured understanding: Morphological feature extractors bring expert-level domain insights, allowing models not just to see, but to truly understand.
Dynamic adaptability: Future models might adjust internal attention patterns based on the image, emphasizing texture for spotted breeds, or structure for solid-colored ones.
Wider applicability: From medical imaging to biodiversity and art authentication, any task involving fine-grained visual distinctions can benefit from this approach.

This visual system—blending ConvNeXtV2, attention mechanisms, and morphological reasoning proves that accuracy and intelligence don’t come from any single architecture, but from the right combination of ideas.

Perhaps the future of AI won’t rely on one perfect design, but on learning to combine cognitive strategies just as the human brain does.

References & Data Source

Research References

Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
Dosovitskiy, A., et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Liu, Z., et al. (2022). ConvNeXt: A ConvNet for the 2020s. CVPR 2022
Liu, Z., et al. (2023). ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. CVPR 2023.
Rockt (2018). Einstein Summation Notation Explained Visually. rockt.github.io
Pytorch Org. torch.einsum

Dataset Sources

Stanford Dogs Dataset – Kaggle Dataset
Originally sourced from Stanford Vision Lab – ImageNet Dogs License: Non-commercial research and educational use only Citation: Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for Fine-Grained Image Categorization. FGVC Workshop, CVPR, 2011
Unsplash Images – Additional images of four breeds (Bichon Frise, Dachshund, Shiba Inu, Havanese) were sourced from Unsplash for dataset augmentation.

Thank you for reading. Through developing PawMatchAI, I’ve learned many valuable lessons about AI vision systems and feature recognition. If you have any perspectives or topics you’d like to discuss, I welcome the opportunity to exchange ideas.
Email
GitHub

Disclaimer

The methods and approaches described in this article are based on my personal research and experimental findings. While the Hybrid Architecture has demonstrated improvements in specific scenarios, its performance may vary depending on datasets, implementation details, and training conditions.

This article is intended for educational and informational purposes only. Readers should conduct independent evaluations and adapt the approach based on their specific use cases. No guarantees are made regarding its effectiveness across all applications.

The post The Art of Hybrid Architectures appeared first on Towards Data Science.

Testing the Power of Multimodal AI Systems in Reading and Interpreting Photographs, Maps, Charts and More

Luciano Abriata — Tue, 25 Mar 2025 18:30:49 +0000

Introduction

It’s no news that artificial intelligence has made huge strides in recent years, particularly with the advent of multimodal models that can process and create both text and images, and some very new ones that also process and produce audio and video. I think that these Ai Systems have the potential to revolutionize data analysis, robotics, and even everyday tasks like navigation and information extraction from visuals. Along these lines, I recently posed myself the following question:

Can multimodal AI systems consisting in large language models with vision capabilities understand figures that contain information, then process it and produce summaries, explanations, object identification, etc.?

Although this is in itself a question that would require a whole research project to be properly addressed, I was extremely curious about it, so I needed to get at least a rough approximation to my question. Therefore, I carried out some tests on how much OpenAI’s vision-enhanced models understand about photographs, screenshots and other images containing plots, charts, shots from a driver’s position or mimicking a robot’s onboard camera, and even molecular structures.

More specifically, I tried the GPT-4o and GPT-4o-mini models developed by OpenAI, through ChatGPT or through their API. To my amazement, I found that these models can indeed understand quite a bit of what they see!

Let me present here the most interesting results I obtained, either directly through ChatGPT or programmatically, including code for a web app where you can paste a picture and have GPT-4o-mini analyze it.

First of All: How to Analyze Images With OpenAI’s GPT-4o Models, Via ChatGPT or Programmatically

I carried out my first tests right away on ChatGPT’s free version using GPT-4o, which allows up to 4 chances per day to use the “vision” capabilities for free. To use this you must upload or paste from clipboard a picture that will be sent together with the prompt. On a paid ChatGPT account you can do this more times, and with the OpenAI API you can do this as long as token credits are available, and directly programmatically as I will cover here.

Processing prompts containing pictures programmatically

But how to do image processing with GPT-4o models programmatically? Well, it isn’t that complex, as explained at OpenAI’s API reference.

In JavaScript, which I love because it allows me to easily write programs that run out of the box online as you already know if you follow me, you just need a function call that looks like this:

async function sendToOpenAI(imageData) {
    const apiKey = "...API KEY HERE";
    const base64Image = imageData.split(",")[1];
    
    const response = await fetch("https://api.openai.com/v1/chat/completions", {
        method: "POST",
        headers: {
            "Content-Type": "application/json",
            "Authorization": `Bearer ${apiKey}`
        },
        body: JSON.stringify({
            model: "gpt-4o-mini",
            messages: [{
                role: "user",
                content: [
                    { type: "text", text: "what's in this image?" },
                    { type: "image_url", image_url: { url: `data:image/jpeg;base64,${base64Image}` } }
                ]
            }]
        })
    });
    
    const result = await response.json();
    document.getElementById("response").innerText = result.choices[0].message.content;
}

A web app to quickly run image-text prompts

In this example web app you can provide an API key and a text prompt, and then paste an image from the clipboard that will be immediately sent to OpenAI for analysis together with the prompt.

https://lucianoabriata.altervista.org/tests/gpt-vision/gpt-4o-mini-vision.html

The web app ends up displaying the answer, as in this example:

This and all other images, photographs and screenshots shown in this article are by the author.

(Keep reading to find towards the end of the post how Google’s Gemini 2.0 interpreted this screenshot, which was just mind-blowingly correct as it captured the artist’s actual intention!)

Challenging GPT-4o Models With Varied Examples of Image-Text Prompts

Using the above app to run GPT-4o-mini and ChatGPT’s free version to run GPT-4o, I evaluated how these models can understand maps, photographs that mimic what a driver sees on the road, camera views about the position of a robot’s arm relative to some target objects, various different kinds of plots and charts presenting data, and software screenshots including some that display molecules.

I am stunned at how much the models understand! Read on to discover it all, also take a look at my first tests with the impressive Gemini 2.0 Flash from Google, and perhaps then go try your favorite AI models with your own ideas, sharing in the comments what you find!

1. Vision-Based Analysis of a Google Maps Screenshot

Sometimes, maps contain so much information that you kinda get lost around. Could a multimodal system be of help in such situation? Let’s find out!

Here on the left is a screenshot from a Google Maps view that I fed into GPT-4o-mini via the API together with the question “What’s in this image?” . You can see the answer on the right:

2. Interpreting Driving Signs in a Road

Could perhaps multimodal systems help you to drive? Of course this deserves a full investigation in order to draw a solid conclusion, but let’s see what happens in some quick tests. In each test, I provided a photograph of what a driver sees in the road on the go, together with this text prompt :

Interpret any road signs or things I should pay attention to as a driver

Here are the results on some example shots, all run with the web app provided above i.e. using GPT-4o-mini:

First, for this photograph:

GPT-4o-mini said the following:

As you drive, pay attention to the following signs and features in the image:

1. **”Keep Right” Sign**: This indicates that you should stay to the right side of the road, likely because of an island or obstruction ahead.

2. **Curved Road Ahead**: The layout of the road suggests you may encounter a curve, which means you should reduce your speed and stay alert for any oncoming vehicles.

3. **Speed Limit**: The sign showing a speed limit of 40 suggests that you should not exceed this speed.

4. **Road Layout**: Be mindful of the marking lines and how they direct traffic. The two solid yellow lines indicate that crossing them (to pass another vehicle) is prohibited.

5. **Surroundings**: Note the vegetation and landscape along the road, which may obscure visibility in some areas. Maintain awareness of pedestrians or cyclists.

Make sure to remain vigilant and adhere to these guidelines for safe driving.

This answer was quite nice, complete, and useful!

Let’s try this next one, in a densely snowed place:

Here it clearly picked up the snow, and correctly inferred it is a region with mountains, offering some quite general advice that suits the situation but doesn’t seem very specific to it:

When driving in snowy mountain conditions, there are several important things to pay attention to:

1. **Yield Sign**: If there’s a yield sign, be prepared to slow down and give way to any traffic on the intersecting road.

2. **Road Conditions**: Snow can make roads slick and reduce visibility. Drive cautiously and maintain a safe distance from other vehicles.

3. **Speed Limit**: Always adhere to the posted speed limit but consider reducing your speed further due to the weather conditions.

4. **Slippery Surfaces**: Be aware that certain areas may be icier or more treacherous. Watch for signs indicating slippery roads or do not enter areas.

5. **Falling Rocks**: In mountainous areas, watch for any signs indicating potential rockfall or land slides, which may be exacerbated by snow.

6. **Wildlife**: In rural or mountainous regions, be alert for wildlife crossing the road, especially in winter months.

7. **Visibility**: If visibility is low due to snow, be sure to use your headlights and reduce your speed.

Always ensure that your vehicle is equipped for winter conditions, including snow tires or chains if necessary.

Let’s try a last one, which features a tunnel, a car right ahead of the driver, some warnings on the left, and two sign posts on the right:

Here’s the answer, which seems to be warning about all general and situation-specific issues:

Here are some important points to note based on the image:

1. **Tunnel Ahead**: You are approaching a tunnel. Prepare for possible changes in visibility and ensure your vehicle’s headlights are on.

2. **Speed and Space**: Always maintain a safe distance from the vehicle in front of you, especially in confined spaces like tunnels.

3. **No Pedestrians or Bicycles**: The sign indicates that pedestrians and bicycles are not allowed in the area. Stay alert to ensure you don’t encounter anyone on the road.

4. **Road Barriers and Traffic Flow**: Pay attention to the road markings and barriers as they guide lane usage and safe navigation near the tunnel entrance.

5. **Weather and Terrain**: The surrounding scenery suggests that weather conditions may change quickly, especially driving in mountain areas. Be prepared for changes in road conditions.

Always drive cautiously and obey any traffic signs or signals present.

Note however that it didn’t note the specific change in speed limit. I tried again this photograph with some more focused text prompts, but it could never say anything about the change in speed limit specific to this situation, rather providing general recommendations such as this:

The maximum speed limit inside a tunnel can vary based on the country’s traffic regulations, the specific road, and the tunnel’s design. Generally, in many places, typical speed limits range from 50 to 80 km/h (about 30 to 50 mph). To determine the exact speed limit for a specific tunnel, you would need to look for posted signs or consult local traffic regulations.

Which is correct but shows that it wasn’t capable of extracting the information specifically available here, which would be something along the lines of “Maximum speed is not indicated, but it is definitely above 50 according to the visible sign post.”

3. Could Vision-Ready Models Drive Robots?

If an AI model understands pictures well enough, one could then think about feeding a robot’s camera feed to the model so that it can produce outputs that make sense. Of course this is a huge thing to study in itself, having to produce commands that end up in mechanical outputs; but let’s see here how the core part — i.e. converting views to broad actions — would work.

Given this photograph of a robotic arm on a table where a red and blue ball await to be grabbed:

I ask GPT-4o the following:

How should the robotic arm move in order to grab the blue sphere?

To which the LLM replies quite correctly, even asking me if I want it to follow up with code generation:

To grab the blue sphere, the robotic arm should follow these steps:

Position the Base: Rotate the base so that the arm aligns with the blue sphere.

Lower the Arm: Adjust the arm’s height by moving its joints to bring the gripper closer to the sphere.

Extend Forward: If necessary, extend the arm forward to reach the blue sphere.

Open the Gripper: Ensure the gripper is open wide enough to grasp the sphere.

Close the Gripper: Gently close the gripper around the sphere to secure it.

Lift the Sphere: Raise the arm slightly while keeping the sphere firmly held.

Retract or Move to Destination: Move the arm to a desired location for placing the sphere.

These movements should be smooth to avoid dropping or misaligning with the sphere. Would you like assistance with programming the robotic arm for this task?

I feel the answers are quite good, but lack precision. So, rather than going down the road of asking for code, I will interrogate the LLM a bit deeper to see what it is really understanding in this picture:

When you say “Position the Base: Rotate the base so that the arm aligns with the blue sphere.” should the arm move to its left or to its right, based on the image?

The answer turns out very good:

Based on the image, the robotic arm should rotate to its left to align with the blue sphere. The sphere is positioned to the left of the arm’s current orientation, so rotating counterclockwise will bring the gripper closer to the sphere.

Yet I dig a bit deeper:

How much do you estimate it should rotate?

This new answer is essentially the same that I could estimate as a human from the photograph:

Estimating from the image, the robotic arm should rotate its base approximately 30 to 45 degrees counterclockwise to align with the blue sphere.

This estimate assumes that:

The arm’s current position is facing slightly to the right.

The blue sphere is placed in front and to the left of the arm.

Once the base is rotated, minor adjustments to the arm joints may be needed to fine-tune the position before grasping the sphere.

Finally, I ask about the red ball, to test if the model understands that it is on the other side and a smaller rotation is required:

And how about grabbing the red ball?

And it does get it right!

To grab the red sphere, the robotic arm should follow these steps:

Rotate the Base: The base should rotate slightly to the left, but much less than for the blue sphere. An estimated 15 to 20 degrees counterclockwise should align the arm with the red sphere.

Lower the Arm: Bring the gripper down closer to the red sphere.

Extend Forward (if needed): Adjust the arm’s reach if the gripper is not directly above the sphere.

Open the Gripper: Ensure the gripper is wide enough to grab the sphere.

Close the Gripper: Securely grasp the red sphere.

Lift the Arm: Raise the sphere carefully.

Move to Destination: Transport the sphere to a desired location if needed.

Since the red sphere is closer to the robotic arm than the blue sphere, less movement is required. Would you like a precise angle calculation based on measurements?

If you’re also excited about the possibility of driving robots with vision models, check out this thread on X.com.

4. Vision Models Acting on Data Plots

Being in Towards Data Science, we can’t escape the obvious question: can vision-assisted models understand the content of plots, perhaps even draw conclusions from them?

I did quite some tests here, dealing with different kinds of information and plot types.

Detecting high tides in a tide chart

I began with the following chart of tide highs and lows for one of the Azores islands, written in Portuguese but on which I asked a question in English:

When are the peak highs in the tides according to the chart?

This was the answer:

The peak high tides in the chart are marked by blue dots with corresponding times. The highest tides occur at the following times:

11 Aug: 11:18
12 Aug: 12:15
13 Aug: 12:58
14 Aug: 13:33
15 Aug: 14:05
16 Aug: 14:35
17 Aug: 15:05

These times represent when the tide reaches its highest point each day.

You can immediately see that the displayed data is correct, but only represents high tides during the day and totally overlooks the high tides that happen during the day! Hence, I asked for that specifically, and got a much more complete answer:

I said:

But the times you say are all during the days… doesn’t the chart show also peak high tides at night?

GPT-4o said:

You’re absolutely right! The chart also shows peak high tides occurring at night. Here are all the peak high tide times (both daytime and nighttime) according to the chart:

11 Aug: 11:18 (Day)
12 Aug: 12:15 (Day) & 23:49 (Night)
13 Aug: 12:58 (Day) & 00:41 (Night)
14 Aug: 13:33 (Day) & 01:20 (Night)
15 Aug: 14:05 (Day) & 01:54 (Night)
16 Aug: 14:35 (Day) & 02:24 (Night)
17 Aug: 15:05 (Day) & 02:53 (Night)

Each day has two peak high tides, one during the day and one at night. Thanks for catching that!

That’s now perfect. Finally, I asked one more piece of information:

OK thanks. Can you say where on earth this chart is for?

And again the answer was right:

Yes! The chart is for Ponta Delgada, São Miguel Island, which is part of the Azores, an autonomous region of Portugal in the North Atlantic Ocean. This location is known for its stunning volcanic landscapes, coastal views, and maritime activities. The tide chart shows the tidal schedule for this area from August 11 to August 17.

Interpreting a height profile from a photo

Next I made the question a bit harder, as the image consisted of a photograph of a screen (see the interfering horizontal bands), the x axis of the plot is not complete, and the question involves some thinking:

Looking at this, would I go up or down from Andermatt to Wassen? By how many meters and over how many km?

The answer was however perfect! Here it is:

From the elevation profile, Andermatt is at 1437m, and Wassen is at 916m. This means that traveling from Andermatt to Wassen involves a descent of 521 meters (1437m — 916m).

Looking at the horizontal axis, the distance from Andermatt to Wassen appears to be around 10 km. So, you would be going downhill by 521 meters over approximately 10 km.

Interpreting plots from RNA-seq data

I then tested GPT-4o’s capability to interpret plots produced as part of a bioinformatics pipeline. First, I made it process a volcano plot, which is one of the most common kinds of graphs used to analyze differential gene expression data. I took an example image from this free online tutorial, and asked the system the following:

You are a bioinformatics expert. What would you conclude from this volcano plot? I mean, which genes are up- or down-regulated?

This was GPT-4o’s answer:

This volcano plot represents differentially expressed genes in Arabidopsis thaliana infected with Pseudomonas syringae DC3000 compared to a mock treatment. Here’s my interpretation:

1. X-axis (Log₂ fold change):
Positive values indicate up-regulated genes (higher expression in infected samples).
Negative values indicate down-regulated genes (lower expression in infected samples).

2. Y-axis (-Log₁₀ P-value):
Higher values indicate stronger statistical significance.

3. Color coding:
Gray: Not significant.
Blue: Significant by p-value but not by fold change.
Red: Significant by both p-value and fold change threshold.

Key findings:

Up-regulated genes (right side, red dots):
Strongly induced genes include AT5G42800, AT1G52440, AT1G73325, AT4G18440, among others.
These genes are highly expressed in infected plants.

Down-regulated genes (left side, red dots):
Examples include AT4G28940, AT5G04100, AT1G04220, AT1G02440.
These genes are significantly suppressed during infection.

Biological Implications:

Up-regulated genes may be involved in plant immune responses, such as defense signaling, stress response, or antimicrobial activity.

Down-regulated genes might be involved in normal cellular functions that get suppressed during infection.

You can see how the first part of the response clearly reads our the main features of the plot, explaining what the axes and label colors are, and even considering the plot’s title. Then, the model distills lists of genes that are up- or down-regulated passing the thresholds with strong statistical significance. Note that the list is correct for the up-regulated genes but has a few problems with the down-regulated ones. On inspection of the plot it becomes clear that this is due to the labels not being clearly resolved, even for humans.

One gear up, somehow the model knows correct information about the identified genes, so it can draw some simple conclusions about the biological significance of the results.

Truly amazing, if it were not just for that small problem on reading the labels, which in a real scenario could probably be improved.

Then in a second test related to RNA-seq data, I asked GPT-4o what it thought about two distributions of p-values shown together in a single graph (image edited from this tutorial):

This figure shows two distributions of p-values from RNA-seq data. Do you think they are OK, or do you notice anything weird?

This was the answer, again excellent:

This histogram represents two distributions of p-values from RNA-seq data, likely comparing two different sets of statistical tests. Here’s my assessment:

* What looks normal?

Uniform distribution of non-significant p-values:

The light blue bars seem to follow a relatively uniform distribution across the range, which is expected under the null hypothesis (genes with no real differential expression should have p-values uniformly distributed between 0 and 1).

Enrichment of low p-values:

The peak near p = 0 suggests that a subset of genes show strong statistical significance, which is expected when there are truly differentially expressed genes.

* What looks weird?

Excessive spike at p = 0:

If there is an unnaturally large spike at exactly 0, this could indicate a bias or an issue in the statistical testing (e.g., overfitting, batch effects, or miscalculated p-values).

Bimodal distribution:

The yellow distribution appears to have multiple peaks, which is unusual. Normally, p-value distributions should either be uniform (under the null hypothesis) or show a single peak at low p-values (for differentially expressed genes).

Peaks at intermediate p-values might suggest issues with normalization, batch effects, or improper multiple-testing corrections.

* Potential issues to check

Overdispersion or poor model fit: Are the statistical methods used properly accounting for biological variability?

Multiple testing correction issues: If false discovery rate (FDR) correction wasn’t applied correctly, the distribution could be skewed.

Batch effects or confounding variables: If the data contains hidden covariates, it can introduce unexpected structure in the p-value distribution.

5. Analysis of Screenshots Containing Molecular Structures

If you follow me you know that I work on various molecular sciences; then, of course I had to try out these vision-enhanced models with such kind of inputs.

Help using a program to generate 3D structures from 2D drawings

First I checked GPT-4o-mini’s capability to help a user who wants to create a molecule in 3D format from a drawing, who knows that this should be possible with Hack-A-Mol but doesn’t know exactly how to use it.

By the way, I covered Hack-A-Mol here:

https://medium.com/advances-in-biological-science/purely-web-based-suites-for-manipulation-of-bio-molecular-data-cbb7fdd60df4

Now let’s see how well GPT-4o can help one to interpret what one obtains in this program. I uploaded this screenshot:

And asked this question:

I’m trying to build a small molecule using this tool, by drawing and then clicking the arrow to the right…. but I don’t get very well what’s being generated exactly… can you explain?

This was the answer:

Yes! When you draw your small molecule and click the arrow, several things are generated:

2D Structure (Left Panel)

This is your drawn molecule in 2D, using standard chemical notation.

3D Structure (Right Panel — JSmol Viewer)

This is the 3D molecular model generated from your 2D drawing.

It gives you a spatial representation of the molecule.

Molecular Identifiers (Below the 2D Editor)

InChI (IUPAC International Chemical Identifier): A text-based representation of the molecular structure.

InChIKey: A hashed, fixed-length version of the InChI, useful for searching databases.

SMILES (Simplified Molecular Input Line Entry System): A compact string representation of the molecule.

Structural Data (Bottom Text Box)

You see data in MOL/SDF format, which contains atomic coordinates, bonding information, and metadata about the molecule.

Other formats available include XYZ, PDB, and CIF, which are useful for computational chemistry and molecular modeling.

What Can You Do with This?

If you want to simulate the molecule or use it in software like AlphaFold 3, you should export the structure in SDF or PDB format.

If you want to search for the molecule online, use the InChIKey or SMILES.

Using GPT-4o with vision to help understand AlphaFold models

In the next example, I tested how well GPT-4o can understand structural models of proteins produced by AlphaFold (in this case AlphaFold 2 as utilized to build the AlphaFold-EBI database of models). By the way, if you have no idea what AlphaFold is, check out these posts:

https://medium.com/advances-in-biological-science/alphafold-3s-advances-and-limitations-in-a-nutshell-5ae0cdd814ea

https://lucianosphere.medium.com/guide-to-my-blog-articles-on-alphafold-39c10a7dad7f

It is well-known that the AlphaFold 2 and 3 systems revolutionized modeling; however, of course they aren’t infallible. To help users understand what’s likely correct and wrong in a structural model, AlphaFold returns not just the atomic coordinates that constitute the structural model but also quality metrics, of which the most important ones are the residue-wise pLDDTs and the inter-residue PAEs. We here tested GPT-4o with vision capabilities on this Uniprot entry precomputed at the AF-EBI database:

The text part of the prompt was:

You are an expert in bioinformatics. I’m checking an entry at the AF-EBI database of protein structure models, and I got what I’m showing in the attached screenshot. What do you think? Is the model good? I’m especially interested in the relative orientation of the two domains, but I’m not sure what to conclude from the PAE plot…

And this was the answer:

From your screenshot, we can analyze the quality of the model based on the pLDDT confidence scores and the Predicted Aligned Error (PAE) plot:

Model Confidence (pLDDT)

The Mol* 3D viewer colors the model based on confidence scores.

Most of the structure is colored in blue (high to very high confidence, pLDDT > 70–90), suggesting that the individual domains are well-predicted.

However, some loops and termini (orange/red, pLDDT < 50–70) have lower confidence, which is expected for flexible regions.

PAE Plot Analysis (Domain Orientation)

The PAE plot (right panel) indicates how well the relative positions of different residues are predicted.

A low PAE (dark green) means the relative position between two residues is well-defined.

A high PAE (lighter green/white areas) suggests uncertainty in the relative positioning.

Interpreting the PAE

The diagonal (self-alignment) is always well-defined.

The block-like structure in the PAE plot suggests two or more well-defined domains with high internal confidence.

However, lighter green areas between the blocks indicate uncertainty in the relative orientation of these domains.

If your interest is in the relative orientation of the domains, this suggests the model may not be reliable in that aspect — it might have multiple plausible conformations.

Conclusion

The individual domains are well-modeled, but their relative orientation might be uncertain.

If functionally relevant, you might consider experimental validation (e.g., cryo-EM, SAXS, or NMR) or molecular dynamics simulations to explore domain flexibility.

If you need a more rigid structure, you could look for homologous experimental structures or use AlphaFold-Multimer if interactions between chains matter.

The answer was pretty neat and correct. However, we realized that we might have confused the model by talking about two domains, when the screenshot actually shows three. We then asked a folow-up question:

Now checking carefully I see the TED Domains thing shows 3 domains… am I right that according to the PAE plot the arrangement of the first two domains is confident, but not so much their arrangement relative to the third one?

The answer, based on an inspection of the PAE plot considering now that there are indeed 3 domains, the first two being confidently modeled relative to each other and the third one only modeled confidently locally, is again perfect:

Yes, your interpretation is correct! The TED Domains annotation indicates that the model has three domains, as shown by the color-coded segments at the top of the PAE plot.

Looking at the PAE plot:

The first two domains (green and orange segments) have low PAE values between them (darker green in the top-left block of the matrix). This suggests that their relative orientation is well-defined and confident.

However, the third domain (purple segment) shows higher PAE values relative to the first two (lighter green/white regions in the lower-right part of the matrix). This means that the relative position of the third domain is uncertain — it may adopt multiple conformations.

What This Means for Your Analysis

The first two domains likely form a rigid unit with a stable interaction.

The third domain may be more flexible relative to the first two, possibly moving independently or being less constrained in solution.

If this flexibility is functionally important (e.g., for ligand binding, protein-protein interactions, or conformational switching), experimental validation like SAXS, cryo-EM, or FRET could be useful.

Visual analysis of protein-ligand complexes

Last, I geared up to challenge GPT-4o with a more difficult problem. Given a 2D snapshot from a molecular graphics program showing a ligand bound to a protein, can it infer what kinds of contacts the two molecules establish?

The question was:

Here’s a screenshot of ibuprofen docked inside a pocket in BSA. What interactions do you see between the ligand and the protein? I want you to tell me exactly what interactions you see in this structure from the screenshot provided.

And this was the answer:

From the screenshot you provided, I can visually infer potential interactions, but without residue labels or a precise atomic distance measurement, I can only make educated observations. Here’s what I specifically see in your image:

1. Hydrophobic Interactions:

Ibuprofen (highlighted in yellow) is embedded in a hydrophobic pocket of the protein.

It appears to be surrounded by gray carbon atoms from amino acid side chains, likely hydrophobic residues (e.g., leucine, valine, phenylalanine, or isoleucine).

The benzene ring of ibuprofen seems to be in close contact with other aromatic or aliphatic side chains, suggesting π-π stacking or van der Waals interactions.

2. Hydrogen Bonds:

The carboxyl (-COOH) group of ibuprofen appears to be interacting with nearby blue (nitrogen) or red (oxygen) atoms from the protein.

There is at least one close contact between the oxygen of the carboxyl group and a protein residue, likely forming a hydrogen bond with a backbone amide or side chain hydroxyl (e.g., Ser, Thr, Tyr, Asp, or Glu).

3. Possible Electrostatic Interactions:

Ibuprofen is negatively charged at physiological pH (carboxylate anion, -COO⁻).

There may be a salt bridge or ion-dipole interaction if a positively charged residue (e.g., Arg, Lys, or His) is nearby, but I cannot confirm the exact residue from the image alone.

What I Cannot Determine from the Screenshot Alone:

Precise interacting residues: No residue labels are visible.

Exact interaction distances: Hydrogen bonds and electrostatic interactions typically require precise atomic distances (~2–3 Å).

Interaction strengths: Molecular modeling software would be needed to confirm interaction energies.

The answer was quite good, especially considering that the picture is just a static 2D view and is quite crowded in atoms. In fact, it correctly pinpoints that the ibuprofen is shown in yellow, and highlights hydrophobic and aromatic contacts around its aromatic ring, both of which do happen (#1 in the figure below), and hydrogen bonds from its carboxylate group to nearby N and O atoms (#2 in the figure), adding that there might also possibly exist a salt bridge between that same carboxylate and some positively charged group of the protein that it can’t distinguish clearly (#3).

For your reference, I show these interactions in the next figure, numbered 1, 2 and 3 as they were mentioned:

Other Vision-Capable Models? Peeking Into Google’s Gemini 2.0 Flash

OpenAI is probably the most famous developer of LLM-based AI systems, but of course not the only one, and their models aren’t even the best for some goals and according to some evaluations. However, they remain, to me at least, the easiest to use both directly in a GUI (ChatGPT) and programmatically (as we have seen here and I’ve shown in many other posts).

But my point here is that we can’t overlook the other models. And just while preparing this post, I decided to give a try at Google’s AI studio’s Gemini 2.0 Flash, which like GPT-4o has picture, audio and document processing capabilities — and even video!

Again I was amazed, and in fact it seemed to perform slightly better than GPT-4o and GPT-4o-mini tried throughout this post. I was in particular stunned at this specific example, where I loaded the screenshot of my app shown above, to which Gemini replied (without any prompt):

Here’s what I can infer based on the image you provided:

Prompt: “What do you think the artist”

Analysis:

The response provided at the bottom of the image states “The artist likely intended to convey themes of nature, beauty, and vibrancy through the striking imagery of the red poppies against a lush green background.”

Essentially, the AI is recognizing the subject matter and interpreting the artist’s intent as showcasing the beauty and vibrancy of nature.

That was kind of similar to what GPT-4o replied, probably because it just read what GPT-4o itself wrote. But what if I ask it a question trying to address the same point I asked GPT-4o?

Here it goes:

do your own interpretation of the picture… what do you think did the author intend to communicate?

And this was Gemini’s reply, now perfectly reflecting what I think the author intended to convey:

Okay, based on the image, here’s my interpretation of what the author might have intended:

Beyond the surface-level beauty, the image hints at a more complex exploration of color harmony and contrast within nature.

The juxtaposition of the natural poppy field with the color wheel suggests a deliberate artistic statement. The artist may be inviting us to analyze the colors present in the natural scene (dominated by greens and reds) and to understand how these colors interact and create a specific visual experience.

The placement of the color wheel with the arrows pointing to the green and red suggest complementary colours.

Perhaps the author wants us to appreciate the inherent color palettes nature provides.

In short, while the surface reading is about beauty, the integration of the color wheel suggests a deeper interest in the science and theory of color as applied to the natural world. The author seems to be inviting a more analytical and informed way of appreciating the beauty we see.

The key part of Gemini’s reply, which GPT-4o didn’t get in any of my tests, is this:

“The placement of the color wheel with the arrows pointing to the green and red suggest complementary colours.”

That was exactly my point as an artist!

If this post has good reception, I will test Gemini 2.0’s various capabilities in more detail. So let me know in the comments what you think!

Conclusions

I guess after having read all these examples, you now also admire the potential of multimodal, vision-capable AI systems. The tests I presented with the GPT-4o models (and the glimpse into Gemini 2.0) proved the ability of the most modern AI systems to interpret quite complex images, being capable of understanding and extracting information from photographs, scientific figures of various kinds, plots and charts, even when containing information as complex as molecular structures. And from the examples we can advance applications as varied as assisting data analysis, assisting driving, and controlling robots — all of them provided some problems are overcome.

This, because some images and questions are quite challenging, of course. In some cases, for example when limited by the resolution of labels in dense plots, or when limited by the lack of 3D perspective in the flat screenshots of molecular structures, or when having to estimate rotation angles for the robotic arm, there’s probably not much to do, and the model remains as limited as even the most expert humans in the subject would.

Yes, it is overall clear that with proper prompting and continued advancements in AI, these models could become invaluable tools for accelerating data interpretation and decision-making, reducing the load of human experts who can dedicate to more complex problems while the software assists non-experts to interpret graphical outputs from software, and who knows maybe some day drive cars and control robots!

www.lucianoabriata.com I write about everything that lies in my broad sphere of interests: nature, science, technology, programming, etc. Subscribe to get my new stories by email. To consult about small jobs check my services page here. You can contact me here. You can tip me here.

The post Testing the Power of Multimodal AI Systems in Reading and Interpreting Photographs, Maps, Charts and More appeared first on Towards Data Science.

From Fuzzy to Precise: How a Morphological Feature Extractor Enhances AI’s Recognition Capabilities

Eric Chung — Tue, 25 Mar 2025 05:34:38 +0000

Introduction: Can AI really distinguish dog breeds like human experts?

One day while taking a walk, I saw a fluffy white puppy and wondered, Is that a Bichon Frise or a Maltese? No matter how closely I looked, they seemed almost identical. Huskies and Alaskan Malamutes, Shiba Inus and Akitas, I always found myself second-guessing. How do professional veterinarians and researchers spot the differences at a glance? What are they focusing on?

This question kept coming back to me while developing PawMatchAI. One day, while struggling to improve my model’s accuracy, I realized that when I recognize objects, I don’t process all details at once. Instead, I first notice the overall shape, then refine my focus on specific features. Could this “coarse-to-fine” processing be the key to how experts identify similar dog breeds so accurately?

Digging into research, I came across a cognitive science paper confirming that human visual recognition relies on multi-level feature analysis. Experts don’t just memorize images, they analyze structured traits such as:

Overall body proportions (large vs. small dogs, square vs. elongated body shapes)
Head features (ear shape, muzzle length, eye spacing)
Fur texture and distribution (soft vs. curly vs. smooth, double vs. single coat)
Color and pattern (specific markings, pigment distribution)
Behavioral and postural features (tail posture, walking style)

This made me rethink traditional CNNs (Convolutional Neural Networks). While they are incredibly powerful at learning local features, they don’t explicitly separate key characteristics the way human experts do. Instead, these features are entangled within millions of parameters without clear interpretability.

So I designed the Morphological Feature Extractor, an approach that helps AI analyze breeds in structured layers—just like how experts do. This architecture specifically focuses on body proportions, head shape, fur texture, tail structure, and color patterns, making AI not just see objects, but understand them.

PawMatchAI is my personal project that can identify 124 dog breeds and provide breed comparisons and recommendations based on user preferences. If you’re interested, you can try it on HuggingFace Space or check out the complete code on GitHub:

HuggingFace: PawMatchAI

GitHub: PawMatchAI

In this article, I’ll dive deeper into this biologically-inspired design and share how I turned simple everyday observations into a practical AI solution.

1. Human vision vs. machine vision: Two fundamentally different ways of perceiving the world

At first, I thought humans and AI recognized objects in a similar way. But after testing my model and looking into cognitive science, I realized something surprising, humans and AI actually process visual information in fundamentally different ways. This completely changed how I approached AI-based recognition.

Human vision: Structured and adaptive

The human visual system follows a highly structured yet flexible approach when recognizing objects:

1⃣ Seeing the big picture first → Our brain first scans the overall shape and size of an object. This is why, just by looking at a dog’s silhouette, we can quickly tell whether it’s a large or small breed. Personally, this is always my first instinct when spotting a dog.

2⃣ Focusing on key features → Next, our attention automatically shifts to the features that best differentiate one breed from another. While researching, I found that professional veterinarians often emphasize ear shape and muzzle length as primary indicators for breed identification. This made me realize how experts make quick decisions.

3⃣ Learning through experience → The more dogs we see, the more we refine our recognition process. Someone seeing a Samoyed for the first time might focus on its fluffy white fur, while an experienced dog enthusiast would immediately recognize its distinctive “Samoyed smile”, a unique upturned mouth shape.

How CNNs “see” the world

Convolutional Neural Networks (CNNs) follow a completely different recognition strategy:

A complex system that’s hard to interpret → CNNs do learn patterns from simple edges and textures to high-level features, but all of this happens inside millions of parameters, making it hard to understand what the model is really focusing on.
When AI confuses the background for the dog → One of the most frustrating problems I ran into was that my model kept misidentifying breeds based on their surroundings. For example, if a dog was in a snowy setting, it almost always guessed Siberian Husky, even if the breed was completely different.

2. Morphological Feature Extractor: Inspiration from cognitive science

2.1 Core design philosophy

Throughout the development of PawMatchAI, I’ve been trying to make the model identify similar-looking dog breeds as accurately as human experts can. However, my early attempts didn’t go as planned. At first, I thought training deeper CNNs with more parameters would improve performance. But no matter how powerful the model became, it still struggled with similar breeds, mistaking Bichon Frises for Maltese, or Huskies for Eskimo Dog. That made me wonder: Can AI really understand these subtle differences just by getting bigger and deeper?

Then I thought back to something I had noticed before, when humans recognize objects, we don’t process everything at once. We start by looking at the overall shape, then gradually zoom in on the details. This got me thinking, what if CNNs could mimic human object recognition habits by starting with overall morphology and then focusing on detailed features? Would this improve recognition capabilities?

Based on this idea, I decided to stop simply making CNNs deeper and instead design a more structured model architecture, ultimately establishing three core design principles:

Explicit morphological features: This made me reflect on my own question: What exactly are professionals looking at? It turns out that veterinarians and breed experts don’t just rely on instinct, they follow a clear set of criteria, focusing on specific traits. So instead of letting the model “guess” which parts matter, I designed it to learn directly from these expert-defined features, making its decision-making process closer to human cognition.
Multi-scale parallel processing: This corresponds to my cognitive insight: humans don’t process visual information linearly but attend to features at different levels simultaneously. When we see a dog, we don’t need to complete our analysis of the overall outline before observing local details; rather, these processes happen concurrently. Therefore, I designed multiple parallel feature analyzers, each focusing on features at different scales, working together rather than sequentially.
Why relationships between features matter more than individual traits: I came to realize that looking at individual features alone often isn’t enough to determine a breed. The recognition process isn’t just about identifying separate traits, it’s about how they interact. For example, a dog with short hair and pointed ears could be a Doberman, if it has a slender body. But if that same combination appears on a stocky, compact frame, it’s more likely a Boston Terrier. Clearly, the way features relate to one another is often the key to distinguishing breeds.

2.2 Technical implementation of the five morphological feature analyzers

Each analyzer uses different convolution kernel sizes and layers to address various features:

1⃣ Body proportion analyzer

# Using large convolution kernels (7x7) to capture overall body features
'body_proportion': nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=7, padding=3),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.Conv2d(128, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU()
)

Initially, I tried even larger kernels but found they focused too much on the background. I eventually used (7×7) kernels to capture overall morphological features, just like how canine experts first notice whether a dog is large, medium, or small, and whether its body shape is square or rectangular. For example, when identifying similar small white breeds (like Bichon Frise vs. Maltese), body proportions are often the initial distinguishing point.

2⃣ Head feature analyzer

# Medium-sized kernels (5x5) are suitable for analyzing head structure
'head_features': nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=5, padding=2),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.Conv2d(128, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU()
)

The head feature analyzer was the part I tested most extensively. The technical challenge was that the head contains multiple key identification points (ears, muzzle, eyes), but their relative positions are crucial for overall recognition. The final design using 5×5 convolution kernels allows the model to learn the relative positioning of these features while maintaining computational efficiency.

3⃣ Tail feature analyzer

'tail_features': nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=5, padding=2),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.Conv2d(128, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU()
)

Tails typically occupy only a small portion of an image and come in many forms. Tail shape is a key identifying feature for certain breeds, such as the curled upward tail of Huskies and the back-curled tail of Samoyeds. The final solution uses a structure similar to the head analyzer but incorporates more data augmentation during training (like random cropping and rotation).

4⃣ Fur feature analyzer

# Small kernels (3x3) are better for capturing fur texture
'fur_features': nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.Conv2d(128, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU()
)

Fur texture and length are critical features for distinguishing visually similar breeds. When judging fur length, a larger receptive field is needed. Through experimentation, I found that stacking two 3×3 convolutional layers improved recognition accuracy.

5⃣ Color pattern analyzer

# Color feature analyzer: analyzing color distribution
'color_pattern': nn.Sequential(
    # First layer: capturing basic color distribution
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),

    # Second layer: analyzing color patterns and markings
    nn.Conv2d(128, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),

    # Third layer: integrating color information
    nn.Conv2d(128, 128, kernel_size=1),
    nn.BatchNorm2d(128),
    nn.ReLU()
)

The color pattern analyzer has a more complex design than other analyzers because of the difficulty in distinguishing between colors themselves and their distribution patterns. For example, German Shepherds and Rottweilers both have black and tan fur, but their distribution patterns differ. The three-layer design allows the model to first capture basic colors, then analyze distribution patterns, and finally integrate this information through 1×1 convolutions.

2.3 Feature interaction and integration mechanism: The key breakthrough

Having different analyzers for each feature is important, but making them interact with each other is the most crucial part:

# Feature attention mechanism: dynamically adjusting the importance of different features
self.feature_attention = nn.MultiheadAttention(
    embed_dim=128,
    num_heads=8,
    dropout=0.1,
    batch_first=True
)

# Feature relationship analyzer: analyzing connections between different morphological features
self.relation_analyzer = nn.Sequential(
    nn.Linear(128 * 5, 256),  # Combination of five morphological features
    nn.LayerNorm(256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.LayerNorm(128),
    nn.ReLU()
)

# Feature integrator: intelligently combining all features
self.feature_integrator = nn.Sequential(
    nn.Linear(128 * 6, in_features),  # Five original features + one relationship feature
    nn.LayerNorm(in_features),
    nn.ReLU()
)

The multi-head attention mechanism is vital for identifying the most representative features of each breed. For example, short-haired breeds rely more on body type and head features for identification, while long-haired breeds depend more on fur texture and color.

2.4 Feature Relationship Analyzer: Why feature relationships are so important

After weeks of frustration, I finally realized my model was missing a crucial element – when we humans identify something, we don’t just recall individual details. Our brains connect the dots, linking features to form a complete image. The relationships between features are just as important as the features themselves. A small dog with pointed ears and fluffy fur is likely a Pomeranian, but the same features on a large dog might indicate a Samoyed.

So I built the Feature Relationship Analyzer to embody this concept. Instead of processing each feature separately, I connected all five morphological features before passing them to the connecting layer. This lets the model learn relationships between features, helping it distinguish breeds that look almost identical at first glance, especially in four key aspects:

Body and head coordination → Shepherd breeds typically have wolf-like heads paired with slender bodies, while bulldog breeds have broad heads with muscular, stocky builds. The model learns these associations rather than processing head and body shapes separately.
Fur and color joint distribution → Certain breeds have specific fur types often accompanied by unique colors. For example, Border Collies tend to have black and white bicolor fur, while Golden Retrievers typically have long golden fur. Recognizing these co-occurring features improves accuracy.
Head and tail paired features → Pointed ears and curled tails are common in northern sled dog breeds (like Samoyeds and Huskies), while drooping ears and straight tails are more typical of hound and spaniel breeds.
Body, fur, and color three-dimensional feature space → Some combinations are strong indicators of specific breeds. Large build, short hair, and black-and-tan coloration almost always point to a German Shepherd.

By focusing on how features interact rather than processing them separately, the Feature Relationship Analyzer bridges the gap between human intuition and AI-based recognition.

2.5 Residual connection: Keeping original information intact

At the end of the forward propagation function, there’s a key residual connection:

# Final integration with residual connection
integrated_features = self.feature_integrator(final_features)

return integrated_features + x  # Residual connection

This residual connection (+ x) serves a few important roles：

Preserving important details → Ensures that while focusing on morphological features, the model still retains key information from the original representation.
Helping deep models train better → In large architectures like ConvNeXtV2, residuals prevent gradients from vanishing, keeping learning stable.
Providing flexibility → If the original features are already useful, the model can “skip” certain transformations instead of forcing unnecessary changes.
Mimicking how the brain processes images → Just like our brains analyze objects and their locations at the same time, the model learns different perspectives in parallel.

In the model design, a similar concept was adopted, allowing different feature analyzers to operate simultaneously, each focusing on different morphological features (like body type, fur, ear shape, etc.). Through residual connections, these different information channels can complement each other, ensuring the model doesn’t miss critical information and thereby improving recognition accuracy.

2.6 Overall workflow

The complete feature processing flow is as follows:

Five morphological feature analyzers simultaneously process spatial features, each using different-sized convolution layers and focusing on different features
The feature attention mechanism dynamically adjusts focus on different features
The feature relationship analyzer captures correlations between features, truly understanding breed characteristics
The feature integrator combines all information (five original features + one relationship feature)
Residual connections ensure no original information is lost

3. Architecture flow diagram: How the morphological feature extractor works

Looking at the diagram, we can see a clear distinction between two processing paths: on the left, a specialized morphological feature extraction process, and on the right, the traditional CNN-based recognition path.

Left path: Morphological feature processing

Input feature tensor: This is the model’s input, featuring information from the CNN’s middle layers, similar to how humans first get a rough outline when viewing an image.
The Feature Space Transformer reshapes compressed 1D features into a structured 2D representation, improving the model’s ability to capture spatial relationships. For example, when analyzing a dog’s ears, their features might be scattered in a 1D vector, making it harder for the model to recognize their connection. By mapping them into 2D space, this transformation brings related traits closer together, allowing the model to process them simultaneously, just as humans naturally do.
2D feature map: This is the transformed two-dimensional representation which, as mentioned above, now has more spatial structure and can be used for morphological analysis.
At the heart of this system are five specialized Morphological Feature Analyzers, each designed to focus on a key aspect of dog breed identification:
- Body Proportion Analyzer: Uses large convolution kernels (7×7) to capture overall shape and proportion relationships, which is the first step in preliminary classification
- Head Feature Analyzer: Uses medium-sized convolution kernels (5×5) combined with smaller ones (3×3), focusing on head shape, ear position, muzzle length, and other key features
- Tail Feature Analyzer: Similarly uses a combination of 5×5 and 3×3 convolution kernels to analyze tail shape, curl degree, and posture, which are often decisive features for distinguishing similar breeds
- Fur Feature Analyzer: Uses consecutive small convolution kernels (3×3), specifically designed to capture fur texture, length, and density – these subtle features
- Color Pattern Analyzer: Employs a multi-layered convolution architecture, including 1×1 convolutions for color integration, specifically analyzing color distribution patterns and specific markings
Similar to how our eyes instinctively focus on the most distinguishing features when recognizing faces, the Feature Attention Mechanism dynamically adjusts its focus on key morphological traits, ensuring the model prioritizes the most relevant details for each breed.

Right path: Standard CNN processing

Original feature representation: The initial feature representation of the image.
CNN backbone (ConvNeXtV2): Uses ConvNeXtV2 as the backbone network, extracting features through standard deep learning methods.
Classifier head: Transforms features into classification probabilities for 124 dog breeds.

Integration path

The Feature Relation Analyzer goes beyond isolated traits, it examines how different features interact, capturing relationships that define a breed’s unique appearance. For example, combinations like “head shape + tail posture + fur texture” might point to specific breeds.
Feature integrator: Integrates morphological features and their relationship information to form a more comprehensive representation.
Enhanced feature representation: The final feature representation, combining original features (through residual connections) and features obtained from morphological analysis.
Finally, the model delivers its prediction, determining the breed based on a combination of original CNN features and morphological analysis.

4. Performance observations of the morphological feature extractor

After analyzing the entire model architecture, the most important question was: Does it actually work? To verify the effectiveness of the Morphological Feature Extractor, I tested 30 photos of dog breeds that models typically confuse. A comparison between models shows a significant improvement: the baseline model correctly classified 23 out of 30 images (76.7%), while the addition of the Morphological Feature Extractor increased accuracy to 90% (27 out of 30 images).

This improvement is not just reflected in numbers but also in how the model differentiates breeds. The heat maps below show which image regions the model focuses on before and after integrating the feature extractor.

4.1 Recognizing a Dachshund’s unique body proportions

Let’s start with a misclassification case. The heatmap below shows that without the Morphological Feature Extractor, the model incorrectly classified a Dachshund as a Golden Retriever.

Without morphological features, the model relied too much on color and fur texture, rather than recognizing the dog’s overall structure. The heat map reveals that the model’s attention was scattered, not just on the dog’s face, but also on background elements like the roof, which likely influenced the misclassification.
Since long-haired Dachshunds and Golden Retrievers share a similar coat color, the model was misled, focusing more on superficial similarities rather than distinguishing key features like body proportions and ear shape.

This shows a common issue with deep learning models, without proper guidance, they can focus on the wrong things and make mistakes. Here, the background distractions kept the model from noticing the Dachshund’s long body and short legs, which set it apart from a Golden Retriever.

However, after integrating the Morphological Feature Extractor, the model’s attention shifted significantly, as seen in the heatmap below:

Key observations from the Dachshund’s attention heatmap:

Background distractions were significantly reduced. The model learned to ignore environmental elements like grass and trees, focusing more on the dog’s structural features.
The model’s focus has shifted to the Dachshund’s facial features, particularly the eyes, nose, and mouth, key traits for breed recognition. Compared to before, attention is no longer scattered, resulting in a more stable and confident classification.

This confirms that the Morphological Feature Extractor helps the model filter out irrelevant background noise and focus on the defining facial traits of each breed, making its predictions more reliable.

4.2 Distinguishing Siberian Huskies from other northern breeds

For sled dogs, the impact of the Morphological Feature Extractor was even more pronounced. Below is a heatmap before the extractor was applied, where the model misclassified a Siberian Husky as an Eskimo Dog.

As seen in the heatmap, the model failed to focus on any distinguishing features, instead displaying a diffused, unfocused attention distribution. This suggests the model was uncertain about the defining traits of a Husky, leading to misclassification.

However, after incorporating the Morphological Feature Extractor, a critical transformation occurred:

Distinguishing Siberian Huskies from other northern breeds (like Alaskan Malamutes) is another case that impressed me. As you can see in the heatmap, the model’s attention is highly concentrated on the Husky’s facial features.

What’s interesting is the yellow highlighted area around the eyes. The Husky’s iconic blue eyes and distinctive “mask” pattern are key features that distinguish it from other sled dogs. The model also notices the Husky’s distinctive ear shape, which is smaller and closer to the head than an Alaskan Malamute’s, forming a distinct triangular shape.

Most surprising to me was that despite the snow and red berries in the background (elements that might interfere with the baseline model), the improved model pays minimal attention to these distractions, focusing on the breed itself.

4.3 Summary of heatmap analysis

Through these heatmaps, we can clearly see how the Morphological Feature Extractor has changed the model’s “thinking process,” making it more similar to expert recognition abilities:

Morphology takes priority over color: The model is no longer swayed by surface features (like fur color) but has learned to prioritize body type, head shape, and other features that experts use to distinguish similar breeds.
Dynamic allocation of attention: The model demonstrates flexibility in feature prioritization: emphasizing body proportions for Dachshunds and facial markings for Huskies, similar to expert recognition processes.
Enhanced interference resistance: The model has learned to ignore backgrounds and non-characteristic parts, maintaining focus on key morphological features even in noisy environments.

5. Potential applications and future improvements

Through this project, I believe the concept of Morphological Feature Extractors won’t be limited to dog breed identification. This concept could be applicable to other domains that rely on recognizing fine-grained differences. However, defining what constitutes a ‘morphological feature’ varies by field, making direct transferability a challenge.

5.1 Applications in fine-grained visual classification

Inspired by biological classification principles, this approach is particularly useful for distinguishing objects with subtle differences. Some practical applications include:

Medical diagnosis: Tumor classification, dermatological analysis, and radiology (X-ray/CT scans), where doctors rely on shape, texture, and boundary features to differentiate conditions.
Plant and insect identification: Certain poisonous mushrooms closely resemble edible ones, requiring expert knowledge to differentiate based on morphology.
Industrial quality control: Detecting microscopic defects in manufactured products, such as shape errors in electronic components or surface scratches on metals.
Art and artifact authentication: Museums and auction houses often rely on texture patterns, carving details, and material analysis to distinguish genuine artifacts from forgeries, an area where AI can assist.

This methodology could also be applied to surveillance and forensic analysis, such as recognizing individuals through gait analysis, clothing details, or vehicle identification in criminal investigations.

5.2 Challenges and future improvements

While the Morphological Feature Extractor has demonstrated its effectiveness, there are several challenges and areas for improvement:

Feature selection flexibility: The current system relies on predefined feature sets. Future enhancements could incorporate adaptive feature selection, dynamically adjusting key features based on object type (e.g., ear shape for dogs, wing structure for birds).
Computational efficiency: Although initially expected to scale well, real-world deployment revealed increased computational complexity, posing limitations for mobile or embedded devices.
Integration with advanced architectures: Combining morphological analysis with models like Transformers or Self-Supervised Learning could enhance performance but introduces challenges in feature representation consistency.
Cross-domain adaptability: While effective for dog breed classification, applying this approach to new fields (e.g., medical imaging or plant identification) requires redefinition of morphological features.
Explainability and few-shot learning potential: The intuitive nature of morphological features may facilitate low-data learning scenarios. However, overcoming deep learning’s dependency on large labeled datasets remains a key challenge.

These challenges indicate areas where the approach can be refined, rather than fundamental flaws in its design.

Conclusion

This development process made me realize that the Morphological Feature Extractor isn’t just another machine learning technique, it’s a step toward making AI think more like humans. Instead of passively memorizing patterns, this approach helps AI focus on key features, much like experts do.

Beyond Computer Vision, this idea could influence AI’s ability to reason, make decisions, and interpret information more effectively. As AI evolves, we are not just improving models but shaping systems that learn in a more human-like way.

Thank you for reading. Through developing PawMatchAI, I’ve gained valuable experience regarding AI visual systems and feature recognition, giving me new perspectives on AI development. If you have any viewpoints or topics you’d like to discuss, I welcome the exchange.

References & data sources

Dataset Sources

Stanford Dogs Dataset – Kaggle Dataset
- Originally sourced from Stanford Vision Lab – ImageNet Dogs
- Citation:
  - Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for Fine-Grained Image Categorization. FGVC Workshop, CVPR, 2011.
Unsplash Images – Additional images of four breeds (Bichon Frise, Dachshund, Shiba Inu, Havanese) were sourced from Unsplash for dataset augmentation.

Research references

DiCarlo, J. J., Zoccolan, D., & Rust, N. C. (2012). How does the brain solve visual object recognition?

Image attribution

All images, unless otherwise noted, are created by the author.

Disclaimer

The methods and approaches described in this article are based on my personal research and experimental findings. While the Morphological Feature Extractor has demonstrated improvements in specific scenarios, its performance may vary depending on datasets, implementation details, and training conditions.

This article is intended for educational and informational purposes only. Readers should conduct independent evaluations and adapt the approach based on their specific use cases. No guarantees are made regarding its effectiveness across all applications.

The post From Fuzzy to Precise: How a Morphological Feature Extractor Enhances AI’s Recognition Capabilities appeared first on Towards Data Science.

Custom Training Pipeline for Object Detection Models

Argo Saakyan — Fri, 07 Mar 2025 20:59:29 +0000

What if you want to write the whole object detection training pipeline from scratch, so you can understand each step and be able to customize it? That’s what I set out to do. I examined several well-known object detection pipelines and designed one that best suits my needs and tasks. Thanks to Ultralytics, YOLOx, DAMO-YOLO, RT-DETR and D-FINE repos, I leveraged them to gain deeper understanding into various design details. I ended up implementing SoTA real-time object detection model D-FINE in my custom pipeline.

Plan

Dataset, Augmentations and transforms:
- Mosaic (with affine transforms)
- Mixup and Cutout
- Other augmentations with bounding boxes
- Letterbox vs simple resize
Training:
- Optimizer
- Scheduler
- EMA
- Batch accumulation
- AMP
- Grad clipping
- Logging
Metrics:
- mAPs from TorchMetrics / cocotools
- How to compute Precision, Recall, IoU?
Pick a suitable solution for your case
Experiments
Attention to data preprocessing
Where to start

Dataset

Dataset processing is the first thing you usually start working on. With object detection, you need to load your image and annotations. Annotations are often stored in COCO format as a json file or YOLO format, with txt file for each image. Let’s take a look at the YOLO format: Each line is structured as: class_id, x_center, y_center, width, height, where bbox values are normalized between 0 and 1.

When you have your images and txt files, you can write your dataset class, nothing tricky here. Load everything, transform (augmentations included) and return during training. I prefer splitting the data by creating a CSV file for each split and then reading it in the Dataloader class rather than physically moving files into train/val/test folders. This is an example of a customization that helped my use case.

Augmentations

Firstly, when augmenting images for object detection, it’s crucial to apply the same transformations to the bounding boxes. To comfortably do that I use Albumentations lib. For example:

    def _init_augs(self, cfg) -> None:
        if self.keep_ratio:
            resize = [
                A.LongestMaxSize(max_size=max(self.target_h, self.target_w)),
                A.PadIfNeeded(
                    min_height=self.target_h,
                    min_width=self.target_w,
                    border_mode=cv2.BORDER_CONSTANT,
                    fill=(114, 114, 114),
                ),
            ]

        else:
            resize = [A.Resize(self.target_h, self.target_w)]
        norm = [
            A.Normalize(mean=self.norm[0], std=self.norm[1]),
            ToTensorV2(),
        ]

        if self.mode == "train":
            augs = [
                A.RandomBrightnessContrast(p=cfg.train.augs.brightness),
                A.RandomGamma(p=cfg.train.augs.gamma),
                A.Blur(p=cfg.train.augs.blur),
                A.GaussNoise(p=cfg.train.augs.noise, std_range=(0.1, 0.2)),
                A.ToGray(p=cfg.train.augs.to_gray),
                A.Affine(
                    rotate=[90, 90],
                    p=cfg.train.augs.rotate_90,
                    fit_output=True,
                ),
                A.HorizontalFlip(p=cfg.train.augs.left_right_flip),
                A.VerticalFlip(p=cfg.train.augs.up_down_flip),
            ]

            self.transform = A.Compose(
                augs + resize + norm,
                bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
            )

        elif self.mode in ["val", "test", "bench"]:
            self.mosaic_prob = 0
            self.transform = A.Compose(
                resize + norm,
                bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
            )

Secondly, there are a lot of interesting and not trivial augmentations:

Mosaic. The idea is simple, let’s take several images (for example 4), and stack them together in a grid (2×2). Then let’s do some affine transforms and feed it to the model.
MixUp. Originally used in image classification (it’s surprising that it works). Idea – let’s take two images, put them onto each other with some percent of transparency. In classification models it usually means that if one image is 20% transparent and the second is 80%, then the model should predict 80% for class 1 and 20% for class 2. In object detection we just get more objects into 1 image.
Cutout. Cutout involves removing parts of the image (by replacing them with black pixels) to help the model learn more robust features.

I see mosaic often applied with Probability 1.0 of the first ~90% of epochs. Then, it’s usually turned off, and lighter augmentations are used. The same idea applies to mixup, but I see it being used a lot less (for the most popular detection framework, Ultralytics, it’s turned off by default. For another one, I see P=0.15). Cutout seems to be used less frequently.

You can read more about those augmentations in these two articles: 1, 2.

Results from just turning on mosaic are pretty good (darker one without mosaic got mAP 0.89 vs 0.92 with, tested on a real dataset)

Author’s metrics on a custom dataset, logged in Wandb

Letterbox or simple resize?

During training, you usually resize the input image to a square. Models often use 640×640 and benchmark on COCO dataset. And there are two main ways how you get there:

Simple resize to a target size.
Letterbox: Resize the longest side to the target size (e.g., 640), preserving the aspect ratio, and pad the shorter side to reach the target dimensions.

Sample from VisDrone dataset with ground truth bounding boxes, preprocessed with a simple resize function

Sample from VisDrone dataset with ground truth bounding boxes, preprocessed with a letterbox

Both approaches have advantages and disadvantages. Let’s discuss them first, and then I will share the results of numerous experiments I ran comparing these approaches.

Simple resize:

Compute goes to the whole image, with no useless padding.
“Dynamic” aspect ratio may act as a form of regularization.
Inference preprocessing perfectly matches training preprocessing (augmentations excluded).
Kills real geometry. Resize distortion could affect the spatial relationships in the image. Although it might be a human bias to think that a fixed aspect ratio is important.

Letterbox:

Preserves real aspect ratio.
During inference, you can cut padding and run not on the square image if you don’t lose accuracy (some models can degrade).
Can train on a bigger image size, then inference with cut padding to get the same inference latency as with simple resize. For example 640×640 vs 832×480. The second one will preserve the aspect ratios and objects will appear +- the same size.

Part of the compute is wasted on gray padding.
Objects get smaller.

How to test it and decide which one to use?

Train from scratch with parameters:

Simple resize, 640×640
Keep aspect ratio, max side 640, and add padding (as a baseline)
Keep aspect ratio, larger image size (for example max side 832), and add padding Then inference 3 models. When the aspect ratio is preserved – cut padding during the inference. Compare latency and metrics.

Example of the same image from above with cut padding (640 × 384):

Sample from VisDrone dataset

Here is what happens when you preserve ratio and inference by cutting gray padding:

params                  |   F1 score  |  latency (ms).   |
------------------------+-------------+------------------|
ratio kept, 832         |    0.633    |        33.5      |
no ratio, 640x640       |    0.617    |        33.4      |

As shown, training with preserved aspect ratio at a larger size (832) achieved a higher F1 score (0.633) compared to a simple 640×640 resize (F1 score of 0.617), while the latency remained similar. Note that some models may degrade if the padding is removed during inference, which kills the whole purpose of this trick and probably the letterbox too.

What does this mean:

Training from scratch:

With the same image size, simple resize gets better accuracy than letterbox.
For letterbox, If you cut padding during the inference and your model doesn’t lose accuracy – you can train and inference with a bigger image size to match the latency, and get a little bit higher metrics (as in the example above).

Training with pre-trained weights initialized:

If you finetune – use the same tactic as the pre-trained model did, it should give you the best results if the datasets are not too different.

For D-FINE I see lower metrics when cutting padding during inference. Also the model was pre-trained on a simple resize. For YOLO, a letterbox is typically a good choice.

Training

Every ML engineer should know how to implement a training loop. Although PyTorch does much of the heavy lifting, you might still feel overwhelmed by the number of design choices available. Here are some key components to consider:

Optimizer – start with Adam/AdamW/SGD.
Scheduler – fixed LR can be ok for Adams, but take a look at StepLR, CosineAnnealingLR or OneCycleLR.
EMA. This is a nice technique that makes training smoother and sometimes achieves higher metrics. After each batch, you update a secondary model (often called the EMA model) by computing an exponential moving average of the primary model’s weights.
Batch accumulation is nice when your vRAM is very limited. Training a transformer-based object detection model means that in some cases even in a middle-sized model you only can fit 4 images into the vRAM. By accumulating gradients over several batches before performing an optimizer step, you effectively simulate a larger batch size without exceeding your memory constraints. Another use case is when you have a lot of negatives (images without target objects) in your dataset and a small batch size, you can encounter unstable training. Batch accumulation can also help here.
AMP uses half precision automatically where applicable. It reduces vRAM usage and makes training faster (if you have a GPU that supports it). I see 40% less vRAM usage and at least a 15% training speed increase.
Grad clipping. Often, when you use AMP, training can become less stable. This can also happen with higher LRs. When your gradients are too big, training will fail. Gradient clipping will make sure gradients are never bigger than a certain value.
Logging. Try Hydra for configs and something like Weights and Biases or Clear ML for experiment tracking. Also, log everything locally. Save your best weights, and metrics, so after numerous experiments, you can always find all the info on the model you need.

    def train(self) -> None:
        best_metric = 0
        cur_iter = 0
        ema_iter = 0
        one_epoch_time = None

        def optimizer_step(step_scheduler: bool):
            """
            Clip grads, optimizer step, scheduler step, zero grad, EMA model update
            """
            nonlocal ema_iter
            if self.amp_enabled:
                if self.clip_max_norm:
                    self.scaler.unscale_(self.optimizer)

torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm)
                self.scaler.step(self.optimizer)
                self.scaler.update()

            else:
                if self.clip_max_norm:

torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm)
                self.optimizer.step()

            if step_scheduler:
                self.scheduler.step()
            self.optimizer.zero_grad()

            if self.ema_model:
                ema_iter += 1
                self.ema_model.update(ema_iter, self.model)

        for epoch in range(1, self.epochs + 1):
            epoch_start_time = time.time()
            self.model.train()
            self.loss_fn.train()
            losses = []

            with tqdm(self.train_loader, unit="batch") as tepoch:
                for batch_idx, (inputs, targets, _) in enumerate(tepoch):
                    tepoch.set_description(f"Epoch {epoch}/{self.epochs}")
                    if inputs is None:
                        continue
                    cur_iter += 1

                    inputs = inputs.to(self.device)
                    targets = [
                        {
                            k: (v.to(self.device) if (v is not None and hasattr(v, "to")) else v)
                            for k, v in t.items()
                        }
                        for t in targets
                    ]

                    lr = self.optimizer.param_groups[0]["lr"]

                    if self.amp_enabled:
                        with autocast(self.device, cache_enabled=True):
                            output = self.model(inputs, targets=targets)
                        with autocast(self.device, enabled=False):
                            loss_dict = self.loss_fn(output, targets)
                        loss = sum(loss_dict.values()) / self.b_accum_steps
                        self.scaler.scale(loss).backward()

                    else:
                        output = self.model(inputs, targets=targets)
                        loss_dict = self.loss_fn(output, targets)
                        loss = sum(loss_dict.values()) / self.b_accum_steps
                        loss.backward()

                    if (batch_idx + 1) % self.b_accum_steps == 0:
                        optimizer_step(step_scheduler=True)

                    losses.append(loss.item())

                    tepoch.set_postfix(
                        loss=np.mean(losses) * self.b_accum_steps,
                        eta=calculate_remaining_time(
                            one_epoch_time,
                            epoch_start_time,
                            epoch,
                            self.epochs,
                            cur_iter,
                            len(self.train_loader),
                        ),
                        vram=f"{get_vram_usage()}%",
                    )

            # Final update for any leftover gradients from an incomplete accumulation step
            if (batch_idx + 1) % self.b_accum_steps != 0:
                optimizer_step(step_scheduler=False)

            wandb.log({"lr": lr, "epoch": epoch})

            metrics = self.evaluate(
                val_loader=self.val_loader,
                conf_thresh=self.conf_thresh,
                iou_thresh=self.iou_thresh,
                path_to_save=None,
            )

            best_metric = self.save_model(metrics, best_metric)
            save_metrics(
                {}, metrics, np.mean(losses) * self.b_accum_steps, epoch, path_to_save=None
            )

            if (
                epoch >= self.epochs - self.no_mosaic_epochs
                and self.train_loader.dataset.mosaic_prob
            ):
                self.train_loader.dataset.close_mosaic()

            if epoch == self.ignore_background_epochs:
                self.train_loader.dataset.ignore_background = False
                logger.info("Including background images")

            one_epoch_time = time.time() - epoch_start_time

Metrics

For object detection everyone uses mAP, and it is already standardized how we measure those. Use pycocotools or faster-coco-eval or TorchMetrics for mAP. But mAP means that we check how good the model is overall, on all confidence levels. mAP0.5 means that IoU threshold is 0.5 (everything lower is considered as a wrong prediction). I personally don’t fully like this metric, as in production we always use 1 confidence threshold. So why not set the threshold and then compute metrics? That’s why I also always calculate confusion matrices, and based on that – Precision, Recall, F1-score, and IoU.

But logic also might be tricky. Here is what I use:

1 GT (ground truth) object = 1 predicted object, and it’s a TP if IoU > threshold. If there is no prediction for a GT object – it’s a FN. If there is no GT for a prediction – it’s a FP.
1 GT should be matched by a prediction only 1 time. If there are 2 predictions for 1 GT, then I calculate 1 TP and 1 FP.
Class ids should also match. If the model predicts class_0 but GT is class_1, it means FP += 1 and FN += 1.

During training, I select the best model based on the metrics that are relevant to the task. I typically consider the average of mAP50 and F1-score.

Model and loss

I haven’t discussed model architecture and loss function here. They usually go together, and you can choose any model you like and integrate it into your pipeline with everything from above. I did that with DAMO-YOLO and D-FINE, and the results were great.

Pick a suitable solution for your case

Many people use Ultralytics, however it has GPLv3, and you can’t use it in commercial projects unless your code is open source. So people often look into Apache 2 and MIT licensed models. Check out D-FINE, RT-DETR2 or some yolo models like Yolov9.

What if you want to customize something in the pipeline? When you build everything from scratch, you should have full control. Otherwise, try choosing a project with a smaller codebase, as a large one can make it difficult to isolate and modify individual components.

If you don’t need anything custom and your usage is allowed by the Ultralytics license – it’s a great repo to use, as it supports multiple tasks (classification, detection, instance segmentation, key points, oriented bounding boxes), models are efficient and achieve good scores. Reiterating ones more, you probably don’t need a custom training pipeline if you are not doing very specific things.

Experiments

Let me share some results I got with a custom training pipeline with the D-FINE model and compare it to the Ultralytics YOLO11 model on the VisDrone-DET2019 dataset.

Trained from scratch:

model                     |  mAP 0.50.   |    F1-score  |  Latency (ms)  |
--------------------------+--------------+--------------+-------------------------|
YOLO11m TRT               |     0.417    |     0.568    |       15.6     |
YOLO11m TRT dynamic       |     -        |     0.568    |       13.3     |
YOLO11m OV                |      -       |     0.568    |      122.4     |
D-FINEm TRT               |    0.457     |     0.622    |       16.6     |
D-FINEm OV                |    0.457     |     0.622    |       115.3    |

From COCO pre-trained:

model          |    mAP 0.50   |   F1-score  |
---------------+---------------|-------------|
YOLO11m        |     0.456     |    0.600    |
D-FINEm        |     0.506     |    0.649    |

Latency was measured on an RTX 3060 with TensorRT (TRT), static image size 640×640, including the time for cv2.imread. OpenVINO (OV) on i5 14000f (no iGPU). Dynamic means that during inference, gray padding is being cut for faster inference. It worked with the YOLO11 TensorRT version. More details about cutting gray padding above (Letterbox or simple resize section).

One disappointing result is the latency on intel N100 CPU with iGPU ($150 miniPC):

model            | Latency (ms) |
-----------------+--------------|
YOLO11m          |       188    |
D-FINEm          |       272    |
D-FINEs          |       11     |

Author’s screenshot of iGPU usage from n100 machine during model inference

Here, traditional convolutional neural networks are noticeably faster, maybe because of optimizations in OpenVINO for GPUs.

Overall, I conducted over 30 experiments with different datasets (including real-world datasets), models, and parameters and I can say that D-FINE gets better metrics. And it makes sense, as on COCO, it is also higher than all YOLO models.

D-FINE paper comparison to other object detection models

VisDrone experiments:

Author’s metrics logged in WandB, D-FINE model

Author’s metrics logged in WandB, YOLO11 model

Example of D-FINE model predictions (green – GT, blue – pred):

Sample from VisDrone dataset

Final results

Knowing all the details, let’s see a final comparison with the best settings for both models on i12400F and RTX 3060 with the VisDrone dataset:

model                              |   F1-score    |   Latency (ms)    |
-----------------------------------+---------------+-------------------|
YOLO11m TRT dynamic                |      0.600    |        13.3       |
YOLO11m OV                         |      0.600    |       122.4       |
D-FINEs TRT                        |      0.629    |        12.3       |
D-FINEs OV                         |      0.629    |        57.4       |

As shown above, I was able to use a smaller D-FINE model and achieve both faster inference time and accuracy than YOLO11. Beating Ultralytics, the most widely used real-time object detection model, in both speed and accuracy, is quite an accomplishment, isn’t it? The same pattern is observed across several other real-world datasets.

I also tried out YOLOv12, which came out while I was writing this article. It performed similarly to YOLO11 and even achieved slightly lower metrics (mAP 0.456 vs 0.452). It appears that YOLO models have been hitting the wall for the last couple of years. D-FINE was a great update for object detection models.

Finally, let’s see visually the difference between YOLO11m and D-FINEs. YOLO11m, conf 0.25, nms iou 0.5, latency 13.3ms:

Sample from VisDrone dataset

D-FINEs, conf 0.5, no nms, latency 12.3ms:

Sample from VisDrone dataset

Both Precision and Recall are higher with the D-FINE model. And it’s also faster. Here is also “m” version of D-FINE:

Sample from VisDrone dataset

Isn’t it crazy that even that one car on the left was detected?

Attention to data preprocessing

This part can go a little bit outside the scope of the article, but I want to at least quickly mention it, as some parts can be automated and used in the pipeline. What I definitely see as a Computer Vision engineer is that when engineers don’t spend time working with the data – they don’t get good models. You can have all SoTA models and everything done right, but garbage in – garbage out. So, I always pay a ton of attention to how to approach the task and how to gather, filter, validate, and annotate the data. Don’t think that the annotation team will do everything right. Get your hands dirty and check manually some portion of the dataset to be sure that annotations are good and collected images are representative.

Several quick ideas to look into:

Remove duplicates and near duplicates from val/test sets. The model should not be validated on one sample two times, and definitely, you don’t want to have a data leak, by getting two same images, one in training and one in validation sets.
Check how small your objects can be. Everything not visible to your eye should not be annotated. Also, remember that augmentations will make objects appear even smaller (for example, mosaic or zoom out). Configure these augmentations accordingly so you won’t end up with unusably small objects on the image.
When you already have a model for a certain task and need more data – try using your model to pre-annotate new images. Check cases where the model fails and gather more similar cases.

Where to start

I worked a lot on this pipeline, and I am ready to share it with everyone who wants to try it out. It uses the SoTA D-FINE model under the hood and adds some features that were absent in the original repo (mosaic augmentations, batch accumulation, scheduler, more metrics, visualization of preprocessed images and eval predictions, exporting and inference code, better logging, unified and simplified configuration file).

Here is the link to my repo. Here is the original D-FINE repo, where I also contribute. If you need any help, please contact me on LinkedIn. Thank you for your time!

Citations and acknowledgments

DroneVis

@article{zhu2021detection,
  title={Detection and tracking meet drones challenge},
  author={Zhu, Pengfei and Wen, Longyin and Du, Dawei and Bian, Xiao and Fan, Heng and Hu, Qinghua and Ling, Haibin},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  volume={44},
  number={11},
  pages={7380--7399},
  year={2021},
  publisher={IEEE}
}

D-FINE

@misc{peng2024dfine,
      title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
      author={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
      year={2024},
      eprint={2410.13842},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

The post Custom Training Pipeline for Object Detection Models appeared first on Towards Data Science.

On-Device Machine Learning in Spatial Computing

Prithiv Dev Devendran — Mon, 17 Feb 2025 13:00:00 +0000

The landscape of computing is undergoing a profound transformation with the emergence of spatial computing platforms(VR and AR). As we step into this new era, the intersection of virtual reality, Augmented Reality, and on-device machine learning presents unprecedented opportunities for developers to create experiences that seamlessly blend digital content with the physical world.

The introduction of visionOS marks a significant milestone in this evolution. Apple’s Spatial Computing platform combines sophisticated hardware capabilities with powerful development frameworks, enabling developers to build applications that can understand and interact with the physical environment in real time. This convergence of spatial awareness and on-device machine learning capabilities opens up new possibilities for object recognition and tracking applications that were previously challenging to implement.

What We’re Building

In this guide, we’ll be building an app that showcases the power of on-device machine learning in visionOS. We’ll create an app that can recognize and track a diet soda can in real time, overlaying visual indicators and information directly in the user’s field of view.

Our app will leverage several key technologies in the visionOS ecosystem. When a user runs the app, they’re presented with a window containing a rotating 3D model of our target object along with usage instructions. As they look around their environment, the app continuously scans for diet soda cans. Upon detection, it displays dynamic bounding lines around the can and places a floating text label above it, all while maintaining precise tracking as the object or user moves through space.

Before we begin development, let’s ensure we have the necessary tools and understanding in place. This tutorial requires:

The latest version of Xcode 16 with visionOS SDK installed
visionOS 2.0 or later running on an Apple Vision Pro device
Basic familiarity with SwiftUI and the Swift programming language

The development process will take us through several key stages, from capturing a 3D model of our target object to implementing real-time tracking and visualization. Each stage builds upon the previous one, giving you a thorough understanding of developing features powered by on-device machine learning for visionOS.

Building the Foundation: 3D Object Capture

The first step in creating our object recognition system involves capturing a detailed 3D model of our target object. Apple provides a powerful app for this purpose: RealityComposer, available for iOS through the App Store.

When capturing a 3D model, environmental conditions play a crucial role in the quality of our results. Setting up the capture environment properly ensures we get the best possible data for our machine learning model. A well-lit space with consistent lighting helps the capture system accurately detect the object’s features and dimensions. The diet soda can should be placed on a surface with good contrast, making it easier for the system to distinguish the object’s boundaries.

The capture process begins by launching the RealityComposer app and selecting “Object Capture” from the available options. The app guides us through positioning a bounding box around our target object. This bounding box is critical as it defines the spatial boundaries of our capture volume.

RealityComposer — Object Capture Flow — Image By Author

Once we’ve captured all the details of the soda can with the help of the in-app guide and processed the images, a .usdz file containing our 3D model will be created. This file format is specifically designed for AR/VR applications and contains not just the visual representation of our object, but also important information that will be used in the training process.

Training the Reference Model

With our 3D model in hand, we move to the next crucial phase: training our recognition model using Create ML. Apple’s Create ML application provides a straightforward interface for training machine learning models, including specialized templates for spatial computing applications.

To begin the training process, we launch Create ML and select the “Object Tracking” template from the spatial category. This template is specifically designed for training models that can recognize and track objects in three-dimensional space.

CreateML Project Setup — Image By Author

After creating a new project, we import our .usdz file into Create ML. The system automatically analyzes the 3D model and extracts key features that will be used for recognition. The interface provides options for configuring how our object should be recognized in space, including viewing angles and tracking preferences.

Once you’ve imported the 3d model and analyzed it in various angles, go ahead and click on “Train”. Create ML will process our model and begin the training phase. During this phase, the system learns to recognize our object from various angles and under different conditions. The training process can take several hours as the system builds a comprehensive understanding of our object’s characteristics.

Create ML Training Process — Image By Author

The output of this training process is a .referenceobject file, which contains the trained model data optimized for real-time object detection in visionOS. This file encapsulates all the learned features and recognition parameters that will enable our app to identify diet soda cans in the user’s environment.

The successful creation of our reference object marks an important milestone in our development process. We now have a trained model capable of recognizing our target object in real-time, setting the stage for implementing the actual detection and visualization functionality in our visionOS application.

Initial Project Setup

Now that we have our trained reference object, let’s set up our visionOS project. Launch Xcode and select “Create a new Xcode project”. In the template selector, choose visionOS under the platforms filter and select “App”. This template provides the basic structure needed for a visionOS application.

Xcode visionOS Project Setup — Image By Author

In the project configuration dialog, configure your project with these primary settings:

Product Name: SodaTracker
Initial Scene: Window
Immersive Space Renderer: RealityKit
Immersive Space: Mixed

After project creation, we need to make a few essential modifications. First, delete the file named ToggleImmersiveSpaceButton.swift as we won’t be using it in our implementation.

Next, we’ll add our previously created assets to the project. In Xcode’s Project Navigator, locate the “RealityKitContent.rkassets” folder and add the 3D object file (“SodaModel.usdz” file). This 3D model will be used in our informative view. Create a new group named “ReferenceObjects” and add the “Diet Soda.referenceobject” file we generated using Create ML.

The final setup step is to configure the necessary permission for object tracking. Open your project’s Info.plist file and add a new key: NSWorldSensingUsageDescription. Set its value to “Used to track diet sodas”. This permission is required for the app to detect and track objects in the user’s environment.

With these setup steps complete, we have a properly configured visionOS project ready for implementing our object tracking functionality.

Entry Point Implementation

Let’s start with SodaTrackerApp.swift, which was automatically created when we set up our visionOS project. We need to modify this file to support our object tracking functionality. Replace the default implementation with the following code:

import SwiftUI

/**
 SodaTrackerApp is the main entry point for the application.
 It configures the app's window and immersive space, and manages
 the initialization of object detection capabilities.
 
 The app automatically launches into an immersive experience
 where users can see Diet Soda cans being detected and highlighted
 in their environment.
 */
@main
struct SodaTrackerApp: App {
    /// Shared model that manages object detection state
    @StateObject private var appModel = AppModel()
    
    /// System environment value for launching immersive experiences
    @Environment(\.openImmersiveSpace) var openImmersiveSpace
    
    var body: some Scene {
        WindowGroup {
            ContentView()
                .environmentObject(appModel)
                .task {
                    // Load and prepare object detection capabilities
                    await appModel.initializeDetector()
                }
                .onAppear {
                    Task {
                        // Launch directly into immersive experience
                        await openImmersiveSpace(id: appModel.immersiveSpaceID)
                    }
                }
        }
        .windowStyle(.plain)
        .windowResizability(.contentSize)
        
        // Configure the immersive space for object detection
        ImmersiveSpace(id: appModel.immersiveSpaceID) {
            ImmersiveView()
                .environment(appModel)
        }
        // Use mixed immersion to blend virtual content with reality
        .immersionStyle(selection: .constant(.mixed), in: .mixed)
        // Hide system UI for a more immersive experience
        .persistentSystemOverlays(.hidden)
    }
}

The key aspect of this implementation is the initialization and management of our object detection system. When the app launches, we initialize our AppModel which handles the ARKit session and object tracking setup. The initialization sequence is crucial:

.task {
    await appModel.initializeDetector()
}

This asynchronous initialization loads our trained reference object and prepares the ARKit session for object tracking. We ensure this happens before opening the immersive space where the actual detection will occur.

The immersive space configuration is particularly important for object tracking:

.immersionStyle(selection: .constant(.mixed), in: .mixed)

The mixed immersion style is essential for our object tracking implementation as it allows RealityKit to blend our visual indicators (bounding boxes and labels) with the real-world environment where we’re detecting objects. This creates a seamless experience where digital content accurately aligns with physical objects in the user’s space.

With these modifications to SodaTrackerApp.swift, our app is ready to begin the object detection process, with ARKit, RealityKit, and our trained model working together in the mixed reality environment. In the next section, we’ll examine the core object detection functionality in AppModel.swift, another file that was created during project setup.

Core Detection Model Implementation

AppModel.swift, created during project setup, serves as our core detection system. This file manages the ARKit session, loads our trained model, and coordinates the object tracking process. Let’s examine its implementation:

import SwiftUI
import RealityKit
import ARKit

/**
 AppModel serves as the core model for the soda can detection application.
 It manages the ARKit session, handles object tracking initialization,
 and maintains the state of object detection throughout the app's lifecycle.
 
 This model is designed to work with visionOS's object tracking capabilities,
 specifically optimized for detecting Diet Soda cans in the user's environment.
 */
@MainActor
@Observable
class AppModel: ObservableObject {
    /// Unique identifier for the immersive space where object detection occurs
    let immersiveSpaceID = "SodaTracking"
    
    /// ARKit session instance that manages the core tracking functionality
    /// This session coordinates with visionOS to process spatial data
    private var arSession = ARKitSession()
    
    /// Dedicated provider that handles the real-time tracking of soda cans
    /// This maintains the state of currently tracked objects
    private var sodaTracker: ObjectTrackingProvider?
    
    /// Collection of reference objects used for detection
    /// These objects contain the trained model data for recognizing soda cans
    private var targetObjects: [ReferenceObject] = []
    
    /**
     Initializes the object detection system by loading and preparing
     the reference object (Diet Soda can) from the app bundle.
     
     This method loads a pre-trained model that contains spatial and
     visual information about the Diet Soda can we want to detect.
     */
    func initializeDetector() async {
        guard let objectURL = Bundle.main.url(forResource: "Diet Soda", withExtension: "referenceobject") else {
            print("Error: Failed to locate reference object in bundle - ensure Diet Soda.referenceobject exists")
            return
        }
        
        do {
            let referenceObject = try await ReferenceObject(from: objectURL)
            self.targetObjects = [referenceObject]
        } catch {
            print("Error: Failed to initialize reference object: \(error)")
        }
    }
    
    /**
     Starts the active object detection process using ARKit.
     
     This method initializes the tracking provider with loaded reference objects
     and begins the real-time detection process in the user's environment.
     
     Returns: An ObjectTrackingProvider if successfully initialized, nil otherwise
     */
    func beginDetection() async -> ObjectTrackingProvider? {
        guard !targetObjects.isEmpty else { return nil }
        
        let tracker = ObjectTrackingProvider(referenceObjects: targetObjects)
        do {
            try await arSession.run([tracker])
            self.sodaTracker = tracker
            return tracker
        } catch {
            print("Error: Failed to initialize tracking: \(error)")
            return nil
        }
    }
    
    /**
     Terminates the object detection process.
     
     This method safely stops the ARKit session and cleans up
     tracking resources when object detection is no longer needed.
     */
    func endDetection() {
        arSession.stop()
    }
}

At the core of our implementation is ARKitSession, visionOS’s gateway to spatial computing capabilities. The @MainActor attribute ensures our object detection operations run on the main thread, which is crucial for synchronizing with the rendering pipeline.

private var arSession = ARKitSession()
private var sodaTracker: ObjectTrackingProvider?
private var targetObjects: [ReferenceObject] = []

The ObjectTrackingProvider is a specialized component in visionOS that handles real-time object detection. It works in conjunction with ReferenceObject instances, which contain the spatial and visual information from our trained model. We maintain these as private properties to ensure proper lifecycle management.

The initialization process is particularly important:

let referenceObject = try await ReferenceObject(from: objectURL)
self.targetObjects = [referenceObject]

Here, we load our trained model (the .referenceobject file we created in Create ML) into a ReferenceObject instance. This process is asynchronous because the system needs to parse and prepare the model data for real-time detection.

The beginDetection method sets up the actual tracking process:

let tracker = ObjectTrackingProvider(referenceObjects: targetObjects)
try await arSession.run([tracker])

When we create the ObjectTrackingProvider, we pass in our reference objects. The provider uses these to establish the detection parameters — what to look for, what features to match, and how to track the object in 3D space. The ARKitSession.run call activates the tracking system, beginning the real-time analysis of the user’s environment.

Immersive Experience Implementation

ImmersiveView.swift, provided in our initial project setup, manages the real-time object detection visualization in the user’s space. This view processes the continuous stream of detection data and creates visual representations of detected objects. Here’s the implementation:

import SwiftUI
import RealityKit
import ARKit

/**
 ImmersiveView is responsible for creating and managing the augmented reality
 experience where object detection occurs. This view handles the real-time
 visualization of detected soda cans in the user's environment.
 
 It maintains a collection of visual representations for each detected object
 and updates them in real-time as objects are detected, moved, or removed
 from view.
 */
struct ImmersiveView: View {
    /// Access to the app's shared model for object detection functionality
    @Environment(AppModel.self) private var appModel
    
    /// Root entity that serves as the parent for all AR content
    /// This entity provides a consistent coordinate space for all visualizations
    @State private var sceneRoot = Entity()
    
    /// Maps unique object identifiers to their visual representations
    /// Enables efficient updating of specific object visualizations
    @State private var activeVisualizations: [UUID: ObjectVisualization] = [:]
    
    var body: some View {
        RealityView { content in
            // Initialize the AR scene with our root entity
            content.add(sceneRoot)
            
            Task {
                // Begin object detection and track changes
                let detector = await appModel.beginDetection()
                guard let detector else { return }
                
                // Process real-time updates for object detection
                for await update in detector.anchorUpdates {
                    let anchor = update.anchor
                    let id = anchor.id
                    
                    switch update.event {
                    case .added:
                        // Object newly detected - create and add visualization
                        let visualization = ObjectVisualization(for: anchor)
                        activeVisualizations[id] = visualization
                        sceneRoot.addChild(visualization.entity)
                        
                    case .updated:
                        // Object moved - update its position and orientation
                        activeVisualizations[id]?.refreshTracking(with: anchor)
                        
                    case .removed:
                        // Object no longer visible - remove its visualization
                        activeVisualizations[id]?.entity.removeFromParent()
                        activeVisualizations.removeValue(forKey: id)
                    }
                }
            }
        }
        .onDisappear {
            // Clean up AR resources when view is dismissed
            cleanupVisualizations()
        }
    }
    
    /**
     Removes all active visualizations and stops object detection.
     This ensures proper cleanup of AR resources when the view is no longer active.
     */
    private func cleanupVisualizations() {
        for (_, visualization) in activeVisualizations {
            visualization.entity.removeFromParent()
        }
        activeVisualizations.removeAll()
        appModel.endDetection()
    }
}

The core of our object tracking visualization lies in the detector’s anchorUpdates stream. This ARKit feature provides a continuous flow of object detection events:

for await update in detector.anchorUpdates {
    let anchor = update.anchor
    let id = anchor.id
    
    switch update.event {
    case .added:
        // Object first detected
    case .updated:
        // Object position changed
    case .removed:
        // Object no longer visible
    }
}

Each ObjectAnchor contains crucial spatial data about the detected soda can, including its position, orientation, and bounding box in 3D space. When a new object is detected (.added event), we create a visualization that RealityKit will render in the correct position relative to the physical object. As the object or user moves, the .updated events ensure our virtual content stays perfectly aligned with the real world.

Visual Feedback System

Create a new file named ObjectVisualization.swift for handling the visual representation of detected objects. This component is responsible for creating and managing the bounding box and text overlay that appears around detected soda cans:

import RealityKit
import ARKit
import UIKit
import SwiftUI

/**
 ObjectVisualization manages the visual elements that appear when a soda can is detected.
 This class handles both the 3D text label that appears above the object and the
 bounding box that outlines the detected object in space.
 */
@MainActor
class ObjectVisualization {
    /// Root entity that contains all visual elements
    var entity: Entity
    
    /// Entity specifically for the bounding box visualization
    private var boundingBox: Entity
    
    /// Width of bounding box lines - 0.003 provides optimal visibility without being too intrusive
    private let outlineWidth: Float = 0.003
    
    init(for anchor: ObjectAnchor) {
        entity = Entity()
        boundingBox = Entity()
        
        // Set up the main entity's transform based on the detected object's position
        entity.transform = Transform(matrix: anchor.originFromAnchorTransform)
        entity.isEnabled = anchor.isTracked
        
        createFloatingLabel(for: anchor)
        setupBoundingBox(for: anchor)
        refreshBoundingBoxGeometry(with: anchor)
    }
    
    /**
     Creates a floating text label that hovers above the detected object.
     The text uses Avenir Next font for optimal readability in AR space and
     is positioned slightly above the object for clear visibility.
     */
    private func createFloatingLabel(for anchor: ObjectAnchor) {
        // 0.06 units provides optimal text size for viewing at typical distances
        let labelSize: Float = 0.06
        
        // Use Avenir Next for its clarity and modern appearance in AR
        let font = MeshResource.Font(name: "Avenir Next", size: CGFloat(labelSize))!
        let textMesh = MeshResource.generateText("Diet Soda",
                                               extrusionDepth: labelSize * 0.15,
                                               font: font)
        
        // Create a material that makes text clearly visible against any background
        var textMaterial = UnlitMaterial()
        textMaterial.color = .init(tint: .orange)
        
        let textEntity = ModelEntity(mesh: textMesh, materials: [textMaterial])
        
        // Position text above object with enough clearance to avoid intersection
        textEntity.transform.translation = SIMD3(
            anchor.boundingBox.center.x - textMesh.bounds.max.x / 2,
            anchor.boundingBox.extent.y + labelSize * 1.5,
            0
        )
        
        entity.addChild(textEntity)
    }
    
    /**
     Creates a bounding box visualization that outlines the detected object.
     Uses a magenta color transparency to provide a clear
     but non-distracting visual boundary around the detected soda can.
     */
    private func setupBoundingBox(for anchor: ObjectAnchor) {
        let boxMesh = MeshResource.generateBox(size: [1.0, 1.0, 1.0])
        
        // Create a single material for all edges with magenta color
        let boundsMaterial = UnlitMaterial(color: .magenta.withAlphaComponent(0.4))
        
        // Create all edges with uniform appearance
        for _ in 0..<12 {
            let edge = ModelEntity(mesh: boxMesh, materials: [boundsMaterial])
            boundingBox.addChild(edge)
        }
        
        entity.addChild(boundingBox)
    }
    
    /**
     Updates the visualization when the tracked object moves.
     This ensures the bounding box and text maintain accurate positioning
     relative to the physical object being tracked.
     */
    func refreshTracking(with anchor: ObjectAnchor) {
        entity.isEnabled = anchor.isTracked
        guard anchor.isTracked else { return }
        
        entity.transform = Transform(matrix: anchor.originFromAnchorTransform)
        refreshBoundingBoxGeometry(with: anchor)
    }
    
    /**
     Updates the bounding box geometry to match the detected object's dimensions.
     Creates a precise outline that exactly matches the physical object's boundaries
     while maintaining the gradient visual effect.
     */
    private func refreshBoundingBoxGeometry(with anchor: ObjectAnchor) {
        let extent = anchor.boundingBox.extent
        boundingBox.transform.translation = anchor.boundingBox.center
        
        for (index, edge) in boundingBox.children.enumerated() {
            guard let edge = edge as? ModelEntity else { continue }
            
            switch index {
            case 0...3:  // Horizontal edges along width
                edge.scale = SIMD3(extent.x, outlineWidth, outlineWidth)
                edge.position = [
                    0,
                    extent.y / 2 * (index % 2 == 0 ? -1 : 1),
                    extent.z / 2 * (index < 2 ? -1 : 1)
                ]
            case 4...7:  // Vertical edges along height
                edge.scale = SIMD3(outlineWidth, extent.y, outlineWidth)
                edge.position = [
                    extent.x / 2 * (index % 2 == 0 ? -1 : 1),
                    0,
                    extent.z / 2 * (index < 6 ? -1 : 1)
                ]
            case 8...11: // Depth edges
                edge.scale = SIMD3(outlineWidth, outlineWidth, extent.z)
                edge.position = [
                    extent.x / 2 * (index % 2 == 0 ? -1 : 1),
                    extent.y / 2 * (index < 10 ? -1 : 1),
                    0
                ]
            default:
                break
            }
        }
    }
}

The bounding box creation is a key aspect of our visualization. Rather than using a single box mesh, we construct 12 individual edges that form a wireframe outline. This approach provides better visual clarity and allows for more precise control over the appearance. The edges are positioned using SIMD3 vectors for efficient spatial calculations:

edge.position = [
    extent.x / 2 * (index % 2 == 0 ? -1 : 1),
    extent.y / 2 * (index < 10 ? -1 : 1),
    0
]

This mathematical positioning ensures each edge aligns perfectly with the detected object’s dimensions. The calculation uses the object’s extent (width, height, depth) and creates a symmetrical arrangement around its center point.

This visualization system works in conjunction with our ImmersiveView to create real-time visual feedback. As the ImmersiveView receives position updates from ARKit, it calls refreshTracking on our visualization, which updates the transform matrices to maintain precise alignment between the virtual overlays and the physical object.

Informative View

ContentView With Instructions — Image By Author

ContentView.swift, provided in our project template, handles the informational interface for our app. Here’s the implementation:

import SwiftUI
import RealityKit
import RealityKitContent

/**
 ContentView provides the main window interface for the application.
 Displays a rotating 3D model of the target object (Diet Soda can)
 along with clear instructions for users on how to use the detection feature.
 */
struct ContentView: View {
    // State to control the continuous rotation animation
    @State private var rotation: Double = 0
    
    var body: some View {
        VStack(spacing: 30) {
            // 3D model display with rotation animation
            Model3D(named: "SodaModel", bundle: realityKitContentBundle)
                .padding(.vertical, 20)
                .frame(width: 200, height: 200)
                .rotation3DEffect(
                    .degrees(rotation),
                    axis: (x: 0, y: 1, z: 0)
                )
                .onAppear {
                    // Create continuous rotation animation
                    withAnimation(.linear(duration: 5.0).repeatForever(autoreverses: true)) {
                        rotation = 180
                    }
                }
            
            // Instructions for users
            VStack(spacing: 15) {
                Text("Diet Soda Detection")
                    .font(.title)
                    .fontWeight(.bold)
                
                Text("Hold your diet soda can in front of you to see it automatically detected and highlighted in your space.")
                    .font(.body)
                    .multilineTextAlignment(.center)
                    .foregroundColor(.secondary)
                    .padding(.horizontal)
            }
        }
        .padding()
        .frame(maxWidth: 400)
    }
}

This implementation displays our 3D-scanned soda model (SodaModel.usdz) with a rotating animation, providing users with a clear reference of what the system is looking for. The rotation helps users understand how to present the object for optimal detection.

With these components in place, our application now provides a complete object detection experience. The system uses our trained model to recognize diet soda cans, creates precise visual indicators in real-time, and provides clear user guidance through the informational interface.

Conclusion

Our Final App — Image By Author

In this tutorial, we’ve built a complete object detection system for visionOS that showcases the integration of several powerful technologies. Starting from 3D object capture, through ML model training in Create ML, to real-time detection using ARKit and RealityKit, we’ve created an app that seamlessly detects and tracks objects in the user’s space.

This implementation represents just the beginning of what’s possible with on-device machine learning in spatial computing. As hardware continues to evolve with more powerful Neural Engines and dedicated ML accelerators and frameworks like Core ML mature, we’ll see increasingly sophisticated applications that can understand and interact with our physical world in real-time. The combination of spatial computing and on-device ML opens up possibilities for applications ranging from advanced AR experiences to intelligent environmental understanding, all while maintaining user privacy and low latency.

The post On-Device Machine Learning in Spatial Computing appeared first on Towards Data Science.

Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning

Vyacheslav Efimov — Fri, 14 Feb 2025 17:00:00 +0000

Introduction

Data science is undoubtedly one of the most fascinating fields today. Following significant breakthroughs in machine learning about a decade ago, data science has surged in popularity within the tech community. Each year, we witness increasingly powerful tools that once seemed unimaginable. Innovations such as the Transformer architecture, ChatGPT, the Retrieval-Augmented Generation (RAG) framework, and state-of-the-art Computer Vision models — including GANs — have had a profound impact on our world.

However, with the abundance of tools and the ongoing hype surrounding AI, it can be overwhelming — especially for beginners — to determine which skills to prioritize when aiming for a career in data science. Moreover, this field is highly demanding, requiring substantial dedication and perseverance.

The first three parts of this series outlined the necessary skills to become a data scientist in three key areas: math, software engineering, and machine learning. While knowledge of classical Machine Learning and neural network algorithms is an excellent starting point for aspiring data specialists, there are still many important topics in machine learning that must be mastered to work on more advanced projects.

This article will focus solely on the math skills necessary to start a career in Data Science. Whether pursuing this path is a worthwhile choice based on your background and other factors will be discussed in a separate article.

The importance of learning evolution of methods in machine learning

The section below provides information about the evolution of methods in natural language processing (NLP).

In contrast to previous articles in this series, I have decided to change the format in which I present the necessary skills for aspiring data scientists. Instead of directly listing specific competencies to develop and the motivation behind mastering them, I will briefly outline the most important approaches, presenting them in chronological order as they have been developed and used over the past decades in machine learning.

The reason is that I believe it is crucial to study these algorithms from the very beginning. In machine learning, many new methods are built upon older approaches, which is especially true for NLP and computer vision.

For example, jumping directly into the implementation details of modern large language models (LLMs) without any preliminary knowledge may make it very difficult for beginners to grasp the motivation and underlying ideas of specific mechanisms.

Given this, in the next two sections, I will highlight in bold the key concepts that should be studied.

# 04. NLP

Natural language processing (NLP) is a broad field that focuses on processing textual information. Machine learning algorithms cannot work directly with raw text, which is why text is usually preprocessed and converted into numerical vectors that are then fed into neural networks.

Before being converted into vectors, words undergo preprocessing, which includes simple techniques such as parsing, stemming, lemmatization, normalization, or removing stop words. After preprocessing, the resulting text is encoded into tokens. Tokens represent the smallest textual elements in a collection of documents. Generally, a token can be a part of a word, a sequence of symbols, or an individual symbol. Ultimately, tokens are converted into numerical vectors.

NLP roadmap

The bag of words method is the most basic way to encode tokens, focusing on counting the frequency of tokens in each document. However, in practice, this is usually not sufficient, as it is also necessary to account for token importance — a concept introduced in the TF-IDF and BM25 methods. While TF-IDF improves upon the naive counting approach of bag of words, researchers have developed a completely new approach called embeddings.

Embeddings are numerical vectors whose components preserve the semantic meanings of words. Because of this, embeddings play a crucial role in NLP, enabling input data to be trained or used for model inference. Additionally, embeddings can be used to compare text similarity, allowing for the retrieval of the most relevant documents from a collection.

Embeddings can also be used to encode other unstructured data, including images, audio, and videos.

As a field, NLP has been evolving rapidly over the last 10–20 years to efficiently solve various text-related problems. Complex tasks like text translation and text generation were initially addressed using recurrent neural networks (RNNs), which introduced the concept of memory, allowing neural networks to capture and retain key contextual information in long documents.

Although RNN performance gradually improved, it remained suboptimal for certain tasks. Moreover, RNNs are relatively slow, and their sequential prediction process does not allow for parallelization during training and inference, making them less efficient.

Additionally, the original Transformer architecture can be decomposed into two separate modules: BERT and GPT. Both of these form the foundation of the most state-of-the-art models used today to solve various NLP problems. Understanding their principles is valuable knowledge that will help learners advance further when studying or working with other large language models (LLMs).

Transformer architecture

When it comes to LLMs, I strongly recommend studying the evolution of at least the first three GPT models, as they have had a significant impact on the AI world we know today. In particular, I would like to highlight the concepts of few-shot and zero-shot learning, introduced in GPT-2, which enable LLMs to solve text generation tasks without explicitly receiving any training examples for them.

Another important technique developed in recent years is retrieval-augmented generation (RAG). The main limitation of LLMs is that they are only aware of the context used during their training. As a result, they lack knowledge of any information beyond their training data.

Example of a RAG pipeline

The retriever converts the input prompt into an embedding, which is then used to query a vector database. The database returns the most relevant context based on the similarity to the embedding. This retrieved context is then combined with the original prompt and passed to a generative model. The model processes both the initial prompt and the additional context to generate a more informed and contextually accurate response.

A good example of this limitation is the first version of the ChatGPT model, which was trained on data up to the year 2022 and had no knowledge of events that occurred from 2023 onward.

To address this limitation, OpenAI researchers developed a RAG pipeline, which includes a constantly updated database containing new information from external sources. When ChatGPT is given a task that requires external knowledge, it queries the database to retrieve the most relevant context and integrates it into the final prompt sent to the machine learning model.

The goal of distillation is to create a smaller model that can imitate a larger one. In practice, this means that if a large model makes a prediction, the smaller model is expected to produce a similar result.

In the modern era, LLM development has led to models with millions or even billions of parameters. As a consequence, the overall size of these models may exceed the hardware limitations of standard computers or small portable devices, which come with many constraints.

Quantization is the process of reducing the memory required to store numerical values representing a model’s weights.

This is where optimization techniques become particularly useful, allowing LLMs to be compressed without significantly compromising their performance. The most commonly used techniques today include distillation, quantization, and pruning.

Pruning refers to discarding the least important weights of a model.

Fine-tuning

Regardless of the area in which you wish to specialize, knowledge of fine-tuning is a must-have skill! Fine-tuning is a powerful concept that allows you to efficiently adapt a pre-trained model to a new task.

Fine-tuning is especially useful when working with very large models. For example, imagine you want to use BERT to perform semantic analysis on a specific dataset. While BERT is trained on general data, it might not fully understand the context of your dataset. At the same time, training BERT from scratch for your specific task would require a massive amount of resources.

Here is where fine-tuning comes in: it involves taking a pre-trained BERT (or another model) and freezing some of its layers (usually those at the beginning). As a result, BERT is retrained, but this time only on the new dataset provided. Since BERT updates only a subset of its weights and the new dataset is likely much smaller than the original one BERT was trained on, fine-tuning becomes a very efficient technique for adapting BERT’s rich knowledge to a specific domain.

Fine-tuning is widely used not only in NLP but also across many other domains.

# 05. Computer vision

As the name suggests, computer vision (CV) involves analyzing images and videos using machine learning. The most common tasks include image classification, object detection, image segmentation, and generation.

Most CV algorithms are based on neural networks, so it is essential to understand how they work in detail. In particular, CV uses a special type of network called convolutional neural networks (CNNs). These are similar to fully connected networks, except that they typically begin with a set of specialized mathematical operations called convolutions.

Computer vision roadmap

In simple terms, convolutions act as filters, enabling the model to extract the most important features from an image, which are then passed to fully connected layers for further analysis.

The next step is to study the most popular CNN architectures for classification tasks, such as AlexNet, VGG, Inception, ImageNet, and ResNet.

Speaking of the object detection task, the YOLO algorithm is a clear winner. It is not necessary to study all of the dozens of versions of YOLO. In reality, going through the original paper of the first YOLO should be sufficient to understand how a relatively difficult problem like object detection is elegantly transformed into both classification and regression problems. This approach in YOLO also provides a nice intuition on how more complex CV tasks can be reformulated in simpler terms.

While there are many architectures for performing image segmentation, I would strongly recommend learning about UNet, which introduces an encoder-decoder architecture.

Finally, image generation is probably one of the most challenging tasks in CV. Personally, I consider it an optional topic for learners, as it involves many advanced concepts. Nevertheless, gaining a high-level intuition of how generative adversial networks (GAN) function to generate images is a good way to broaden one’s horizons.

In some problems, the training data might not be enough to build a performant model. In such cases, the data augmentation technique is commonly used. It involves the artificial generation of training data from already existing data (images). By feeding the model more diverse data, it becomes capable of learning and recognizing more patterns.

# 06. Other areas

It would be very hard to present in detail the Roadmaps for all existing machine learning domains in a single article. That is why, in this section, I would like to briefly list and explain some of the other most popular areas in data science worth exploring.

First of all, recommender systems (RecSys) have gained a lot of popularity in recent years. They are increasingly implemented in online shops, social networks, and streaming services. The key idea of most algorithms is to take a large initial matrix of all users and items and decompose it into a product of several matrices in a way that associates every user and every item with a high-dimensional embedding. This approach is very flexible, as it then allows different types of comparison operations on embeddings to find the most relevant items for a given user. Moreover, it is much more rapid to perform analysis on small matrices rather than the original, which usually tends to have huge dimensions.

Matrix decomposition in recommender systems is one of the most commonly used methods

Ranking often goes hand in hand with RecSys. When a RecSys has identified a set of the most relevant items for the user, ranking algorithms are used to sort them to determine the order in which they will be shown or proposed to the user. A good example of their usage is search engines, which filter query results from top to bottom on a web page.

Closely related to ranking, there is also a matching problem that aims to optimally map objects from two sets, A and B, in a way that, on average, every object pair (a, b) is mapped “well” according to a matching criterion. A use case example might include distributing a group of students to different university disciplines, where the number of spots in each class is limited.

Clustering is an unsupervised machine learning task whose objective is to split a dataset into several regions (clusters), with each dataset object belonging to one of these clusters. The splitting criteria can vary depending on the task. Clustering is useful because it allows for grouping similar objects together. Moreover, further analysis can be applied to treat objects in each cluster separately.

The goal of clustering is to group dataset objects (on the left) into several categories (on the right) based on their similarity.

Dimensionality reduction is another unsupervised problem, where the goal is to compress an input dataset. When the dimensionality of the dataset is large, it takes more time and resources for machine learning algorithms to analyze it. By identifying and removing noisy dataset features or those that do not provide much valuable information, the data analysis process becomes considerably easier.

Similarity search is an area that focuses on designing algorithms and data structures (indexes) to optimize searches in a large database of embeddings (vector database). More precisely, given an input embedding and a vector database, the goal is to approximately find the most similar embedding in the database relative to the input embedding.

The goal of similarity search is to approximately find the most similar embedding in a vector database relative to a query embedding.

The word “approximately” means that the search is not guaranteed to be 100% precise. Nevertheless, this is the main idea behind similarity search algorithms — sacrificing a bit of accuracy in exchange for significant gains in prediction speed or data compression.

Time series analysis involves studying the behavior of a target variable over time. This problem can be solved using classical tabular algorithms. However, the presence of time introduces new factors that cannot be captured by standard algorithms. For instance:

the target variable can have an overall trend, where in the long term its values increase or decrease (e.g., the average yearly temperature rising due to global warming).
the target variable can have a seasonality which makes its values change based on the currently given period (e.g. temperature is lower in winter and higher in summer).

Most of the time series models take both of these factors into account. In general, time series models are mainly used a lot in financial, stock or demographic analysis.

Time series data if often decomposed in several components which include trend and seasonality.

Another advanced area I would recommend exploring is reinforcement learning, which fundamentally changes the algorithm design compared to classical machine learning. In simple terms, its goal is to train an agent in an environment to make optimal decisions based on a reward system (also known as the “trial and error approach”). By taking an action, the agent receives a reward, which helps it understand whether the chosen action had a positive or negative effect. After that, the agent slightly adjusts its strategy, and the entire cycle repeats.

Reinforcement learning framework. Image adopted by the author. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Reinforcement learning is particularly popular in complex environments where classical algorithms are not capable of solving a problem. Given the complexity of reinforcement learning algorithms and the computational resources they require, this area is not yet fully mature, but it has high potential to gain even more popularity in the future.

Main applications of reinforcement learning

Currently the most popular applications are:

Games. Existing approaches can design optimal game strategies and outperform humans. The most well-known examples are chess and Go.
Robotics. Advanced algorithms can be incorporated into robots to help them move, carry objects or complete routine tasks at home.
Autopilot. Reinforcement learning methods can be developed to automatically drive cars, control helicopters or drones.

Conclusion

This article was a logical continuation of the previous part and expanded the skill set needed to become a data scientist. While most of the mentioned topics require time to master, they can add significant value to your portfolio. This is especially true for the NLP and CV domains, which are in high demand today.

After reaching a high level of expertise in data science, it is still crucial to stay motivated and consistently push yourself to learn new topics and explore emerging algorithms.

Data science is a constantly evolving field, and in the coming years, we might witness the development of new state-of-the-art approaches that we could not have imagined in the past.

Resources

Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

All images are by the author unless noted otherwise.

The post Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning appeared first on Towards Data Science.

Show and Tell

Muhammad Ardi — Mon, 03 Feb 2025 16:30:24 +0000

Photo by Ståle Grut on Unsplash

Introduction

Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision, directs me to a completely different world. Over time, we can now observe that AI becomes more and more advanced, with the intersection between multiple fields of study getting more common, including the two I just mentioned.

Today, many language models have capability to generate images based on the given prompt. That’s one example of the bridge between NLP and Computer Vision. But I guess I’ll save it for my upcoming article as it is a bit more complex. Instead, in this article I am going to discuss the simpler one: image captioning. As the name suggests, this is essentially a technique where a specific model accepts an image and returns a text that describes the input image.

One of the earliest papers in this topic is the one titled "Show and Tell: A Neural Image Caption Generator" written by Vinyals et al. back in 2015 [1]. In this article, I will focus on implementing the Deep Learning model proposed in the paper using PyTorch. Note that I won’t actually demonstrate the training process here as that’s a topic on its own. Let me know in the comments if you want a separate tutorial on that.

Image Captioning Framework

Generally speaking, image captioning can be done by combining two types of models: the one specialized to process images and another one capable of processing sequences. I believe you already know what kind of models work best for the two tasks – yes, you’re right, those are CNN and RNN, respectively. The idea here is that the CNN is utilized to encode the input image (hence this part is called encoder), whereas the RNN is used for generating a sequence of words based on the features encoded by the CNN (hence the RNN part is called decoder).

It is discussed in the paper that the authors attempted to do so using GoogLeNet (a.k.a., Inception V1) for the encoder and LSTM for the decoder. In fact, the use of GoogLeNet is not explicitly mentioned, yet based on the illustration provided in the paper it seems like the architecture used in the encoder is adopted from the original GoogLeNet paper [2]. The figure below shows what the proposed architecture looks like.

Figure 1. The image captioning model proposed in [1], where the encoder part (the leftmost block) implements the GoogLeNet model [2].

Talking more specifically about the connection between the encoder and the decoder, there are several methods available for connecting the two, namely init-inject, pre-inject, par-inject and merge, as mentioned in [3]. In the case of the Show and Tell paper, authors used pre-inject, a method where the features extracted by the encoder are perceived as the 0th word in the caption. Later in the inference phase, we expect the decoder to generate a caption based solely on these image features.

Figure 2. The four methods possible to be used to connect the encoder and the decoder part of an image captioning model [3]. In our case we are going to use the pre-inject method (b).

As we already understood the theory behind the image captioning model, we can now jump into the code!

Implementation

I’ll break the implementation part into three sections: the Encoder, the Decoder, and the combination of the two. Before we actually get into them, we need to import the modules and initialize the required parameters in advance. Look at the Codeblock 1 below to see the modules I use.

# Codeblock 1
import torch  #(1)
import torch.nn as nn  #(2)
import torchvision.models as models  #(3)
from torchvision.models import GoogLeNet_Weights  #(4)

Let’s break down these imports quickly: the line marked with #(1) is used for basic operations, line #(2) is for initializing neural network layers, line #(3) is for loading various deep learning models, and #(4) is the pretrained weights for the GoogLeNet model.

Talking about the parameter configuration, EMBED_DIM and LSTM_HIDDEN_DIM are the only two parameters mentioned in the paper, which are both set to 512 as shown at line #(1) and #(2) in the Codeblock 2 below. The EMBED_DIM variable essentially indicates the feature vector size representing a single token in the caption. In this case, we can simply think of a single token as an individual word. Meanwhile, LSTM_HIDDEN_DIM is a variable representing the hidden state size inside the LSTM cell. This paper does not mention how many times this RNN-based layer is repeated, but based on the diagram in Figure 1, it seems like it only implements a single LSTM cell. Thus, at line #(3) I set the NUM_LSTM_LAYERS variable to 1.

# Codeblock 2
EMBED_DIM       = 512    #(1)
LSTM_HIDDEN_DIM = 512    #(2)
NUM_LSTM_LAYERS = 1      #(3)

IMAGE_SIZE      = 224    #(4)
IN_CHANNELS     = 3      #(5)

SEQ_LENGTH      = 30     #(6)
VOCAB_SIZE      = 10000  #(7)

BATCH_SIZE      = 1

The next two parameters are related to the input image, namely IMAGE_SIZE (#(4)) and IN_CHANNELS (#(5)). Since we are about to use GoogLeNet for the encoder, we need to match it with its original input shape (3×224×224). Not only for the image, but we also need to configure the parameters for the caption. Here we assume that the caption length is no more than 30 words (#(6)) and the number of unique words in the dictionary is 10000 (#(7)). Lastly, the BATCH_SIZE parameter is used because by default PyTorch processes tensors in a batch. Just to make things simple, the number of image-caption pair within a single batch is set to 1.

GoogLeNet Encoder

It is actually possible to use any kind of CNN-based model for the encoder. I found on the internet that [4] uses DenseNet, [5] uses Inception V3, and [6] utilizes ResNet for the similar tasks. However, since my goal is to reproduce the model proposed in the paper as closely as possible, I am using the pretrained GoogLeNet model instead. Before we get into the encoder implementation, let’s see what the GoogLeNet architecture looks like using the following code.

# Codeblock 3
models.googlenet()

The resulting output is very long as it lists literally all layers inside the architecture. Here I truncate the output since I only want you to focus on the last layer (the fc layer marked with #(1) in the Codeblock 3 Output below). You can see that this linear layer maps a feature vector of size 1024 into 1000. Normally, in a standard image classification task, each of these 1000 neurons corresponds to a specific class. So, for example, if you want to perform a 5-class classification task, you would need to modify this layer such that it projects the outputs to 5 neurons only. In our case, we need to make this layer produce a feature vector of length 512 (EMBED_DIM). With this, the input image will later be represented as a 512-dimensional vector after being processed by the GoogLeNet model. This feature vector size will exactly match with the token embedding dimension, allowing it to be treated as a part of our word sequence.

# Codeblock 3 Output
GoogLeNet(
  (conv1): BasicConv2d(
    (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (maxpool1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True)
  (conv2): BasicConv2d(
    (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )

  .
  .
  .
  .

  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=1024, out_features=1000, bias=True)  #(1)
)

Now let’s actually load and modify the GoogLeNet model, which I do in the InceptionEncoder class below.

# Codeblock 4a
class InceptionEncoder(nn.Module):
    def __init__(self, fine_tune):  #(1)
        super().__init__()
        self.googlenet = models.googlenet(weights=GoogLeNet_Weights.IMAGENET1K_V1)  #(2)
        self.googlenet.fc = nn.Linear(in_features=self.googlenet.fc.in_features,  #(3)
                                      out_features=EMBED_DIM)  #(4)

        if fine_tune == True:       #(5)
            for param in self.googlenet.parameters():
                param.requires_grad = True
        else:
            for param in self.googlenet.parameters():
                param.requires_grad = False

        for param in self.googlenet.fc.parameters():
            param.requires_grad = True

The first thing we do in the above code is to load the model using models.googlenet(). It is mentioned in the paper that the model is already pretrained on the ImageNet dataset. Thus, we need to pass GoogLeNet_Weights.IMAGENET1K_V1 into the weights parameter, as shown at line #(2) in Codeblock 4a. Next, at line #(3) we access the classification head through the fc attribute, where we replace the existing linear layer with a new one having the output dimension of 512 (EMBED_DIM) (#(4)). Since this GoogLeNet model is already trained, we don’t need to train it from scratch. Instead, we can either perform fine-tuning or transfer learning in order to adapt it to the image captioning task.

In case you’re not yet familiar with the two terms, fine-tuning is a method where we update the weights of the entire model. On the other hand, transfer learning is a technique where we only update the weights of the layers we replaced (in this case it’s the last fully-connected layer), while setting the weights of the existing layers non-trainable. To do so, I implement a flag named fine_tune at line #(1) which will let the model to perform fine-tuning whenever it is set to True (#(5)).

The forward() method is pretty straightforward since what we do here is simply passing the input image through the modified GoogLeNet model. See the Codeblock 4b below for the details. Additionally, here I also print out the tensor dimension before and after processing so that you can better understand how the InceptionEncoder model works.

# Codeblock 4b
    def forward(self, images):
        print(f'originalt: {images.size()}')
        features = self.googlenet(images)
        print(f'after googlenett: {features.size()}')

        return features

To test whether our decoder works properly, we can pass a dummy tensor of size 1×3×224×224 through the network as demonstrated in Codeblock 5. This tensor dimension simulates a single RGB image of size 224×224. You can see in the resulting output that our image now becomes a single-dimensional feature vector with the length of 512.

# Codeblock 5
inception_encoder = InceptionEncoder(fine_tune=True)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
features = inception_encoder(images)

# Codeblock 5 Output
original         : torch.Size([1, 3, 224, 224])
after googlenet  : torch.Size([1, 512])

LSTM Decoder

As we have successfully implemented the encoder, now that we are going to create the LSTM decoder, which I demonstrate in Codeblock 6a and 6b. What we need to do first is to initialize the required layers, namely an embedding layer (#(1)), the LSTM layer itself (#(2)), and a standard linear layer (#(3)). The first one (nn.Embedding) is responsible for mapping every single token into a 512 (EMBED_DIM)-dimensional vector. Meanwhile, the LSTM layer is going to generate a sequence of embedded tokens, where each of these tokens will be mapped into a 10000 (VOCAB_SIZE)-dimensional vector by the linear layer. Later on, the values contained in this vector will represent the likelihood of each word in the dictionary being chosen.

# Codeblock 6a
class LSTMDecoder(nn.Module):
    def __init__(self):
        super().__init__()

        #(1)
        self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
                                      embedding_dim=EMBED_DIM)
        #(2)
        self.lstm = nn.LSTM(input_size=EMBED_DIM, 
                            hidden_size=LSTM_HIDDEN_DIM, 
                            num_layers=NUM_LSTM_LAYERS, 
                            batch_first=True)
        #(3)        
        self.linear = nn.Linear(in_features=LSTM_HIDDEN_DIM, 
                                out_features=VOCAB_SIZE)

Next, let’s define the flow of the network using the following code.

# Codeblock 6b
    def forward(self, features, captions):                 #(1)
        print(f'features originalt: {features.size()}')
        features = features.unsqueeze(1)                   #(2)
        print(f"after unsqueezett: {features.shape}")

        print(f'captions originalt: {captions.size()}')
        captions = self.embedding(captions)                #(3)
        print(f"after embeddingtt: {captions.shape}")

        captions = torch.cat([features, captions], dim=1)  #(4)
        print(f"after concattt: {captions.shape}")

        captions, _ = self.lstm(captions)                  #(5)
        print(f"after lstmtt: {captions.shape}")

        captions = self.linear(captions)                   #(6)
        print(f"after lineartt: {captions.shape}")

        return captions

You can see in the above code that the forward() method of the LSTMDecoder class accepts two inputs: features and captions, where the former is the image that has been processed by the InceptionEncoder, while the latter is the caption of the corresponding image serving as the ground truth (#(1)). The idea here is that we are going to perform pre-inject operation by prepending the features tensor into captions using the code at line #(4). However, keep in mind that we need to adjust the shape of both tensors beforehand. To do so, we have to insert a single dimension at the 1st axis of the image features (#(2)). Meanwhile, the shape of the captions tensor will align with our requirement right after being processed by the embedding layer (#(3)). As the features and captions have been concatenated, we then pass this tensor through the LSTM layer (#(5)) before it is eventually processed by the linear layer (#(6)). Look at the testing code below to better understand the flow of the two tensors.

# Codeblock 7
lstm_decoder = LSTMDecoder()

features = torch.randn(BATCH_SIZE, EMBED_DIM)  #(1)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(2)

captions = lstm_decoder(features, captions)

In Codeblock 7, I assume that features is a dummy tensor that represents the output of the InceptionEncoder model (#(1)). Meanwhile, captions is the tensor representing a sequence of tokenized words, where in this case I initialize it as random numbers ranging between 0 to 10000 (VOCAB_SIZE) with the length of 30 (SEQ_LENGTH) (#(2)).

We can see in the output below that the features tensor initially has the dimension of 1×512 (#(1)). This tensor shape changed to 1×1×512 after being processed with the unsqueeze() operation (#(2)). The additional dimension in the middle (1) allows the tensor to be treated as a feature vector corresponding to a single timestep, which is necessary for compatibility with the LSTM layer. To the captions tensor, its shape changed from 1×30 (#(3)) to 1×30×512 (#(4)), indicating that every single word is now represented as a 512-dimensional vector.

# Codeblock 7 Output
features original : torch.Size([1, 512])       #(1)
after unsqueeze   : torch.Size([1, 1, 512])    #(2)
captions original : torch.Size([1, 30])        #(3)
after embedding   : torch.Size([1, 30, 512])   #(4)
after concat      : torch.Size([1, 31, 512])   #(5)
after lstm        : torch.Size([1, 31, 512])   #(6)
after linear      : torch.Size([1, 31, 10000]) #(7)

After pre-inject operation is performed, our tensor is now having the dimension of 1×31×512, where the features tensor becomes the token at the 0th timestep in the sequence (#(5)). See the following figure to better illustrate this idea.

Figure 3. What the resulting tensor looks like after the pre-injection operation. [3].

Next, we pass the tensor through the LSTM layer, which in this particular case the output tensor dimension remains the same. However, it is important to note that the tensor shapes at line #(5) and #(6) in the above output are actually specified by different parameters. The dimensions appear to match here because EMBED_DIM and LSTM_HIDDEN_DIM were both set to 512. Normally, if we use a different value for LSTM_HIDDEN_DIM, then the output dimension is going to be different as well. Finally, we projected each of the 31 token embeddings to a vector of size 10000, which will later contain the probability of every possible token being predicted (#(7)).

GoogLeNet Encoder + LSTM Decoder

At this point, we have successfully created both the encoder and the decoder parts of the image captioning model. What I am going to do next is to combine them together in the ShowAndTell class below.

# Codeblock 8a
class ShowAndTell(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = InceptionEncoder(fine_tune=True)  #(1)
        self.decoder = LSTMDecoder()     #(2)

    def forward(self, images, captions):
        features = self.encoder(images)  #(3)
        print(f"after encodert: {features.shape}")

        captions = self.decoder(features, captions)      #(4)
        print(f"after decodert: {captions.shape}")

        return captions

I think the above code is pretty straightforward. In the __init__() method, we only need to initialize the InceptionEncoder as well as the LSTMDecoder models (#(1) and #(2)). Here I assume that we are about to perform fine-tuning rather than transfer learning, so I set the fine_tune parameter to True. Theoretically speaking, fine-tuning is better than transfer learning if you have a relatively large dataset since it works by re-adjusting the weights of the entire model. However, if your dataset is rather small, you should go with transfer learning instead – but that’s just the theory. It’s definitely a good idea to experiment with both options to see which works best in your case.

Still with the above codeblock, we configure the forward() method to accept image-caption pairs as input. With this configuration, we basically design this method such that it can only be used for training purpose. Here we initially process the raw image with the GoogLeNet inside the encoder block (#(3)). Afterwards, we pass the extracted features as well as the tokenized captions into the decoder block and let it produce another token sequence (#(4)). In the actual training, this caption output will then be compared with the ground truth to compute the error. This error value is going to be used to compute gradients through backpropagation, which determines how the weights in the network are updated.

It is important to know that we cannot use the forward() method to perform inference, so we need a separate one for that. In this case, I am going to implement the code specifically to perform inference in the generate() method below.

# Codeblock 8b
    def generate(self, images):  #(1)
        features = self.encoder(images)              #(2)
        print(f"after encodertt: {features.shape}n")

        words = []  #(3)
        for i in range(SEQ_LENGTH):                  #(4)
            print(f"iteration #{i}")
            features = features.unsqueeze(1)
            print(f"after unsqueezett: {features.shape}")

            features, _ = self.decoder.lstm(features)
            print(f"after lstmtt: {features.shape}")

            features = features.squeeze(1)           #(5)
            print(f"after squeezett: {features.shape}")

            probs = self.decoder.linear(features)    #(6)
            print(f"after lineartt: {probs.shape}")

            _, word = probs.max(dim=1)  #(7)
            print(f"after maxtt: {word.shape}")

            words.append(word.item())  #(8)

            if word == 1:  #(9)
                break

            features = self.decoder.embedding(word)  #(10)
            print(f"after embeddingtt: {features.shape}n")

        return words       #(11)

Instead of taking two inputs like the previous one, the generate() method takes raw image as the only input (#(1)). Since we want the features extracted from the image to be the initial input token, we first need to process the raw input image with the encoder block prior to actually generating the subsequent tokens (#(2)). Next, we allocate an empty list for storing the token sequence to be produced later (#(3)). The tokens themselves are generated one by one, so we wrap the entire process inside a for loop, which is going to stop iterating once it reaches at most 30 (SEQ_LENGTH) words (#(4)).

The steps done inside the loop is algorithmically similar to the ones we discussed earlier. However, since the LSTM cell here generates a single token at a time, the process requires the tensor to be treated a bit differently from the one passed through the forward() method of the LSTMDecoder class back in Codeblock 6b. The first difference you might notice is the squeeze() operation (#(5)), which is basically just a technical step to be done such that the subsequent layer does the linear projection correctly (#(6)). Then, we take the index of the feature vector having the highest value, which corresponds to the token most likely to come next (#(7)), and append it to the list we allocated earlier (#(8)). The loop is going to break whenever the predicted index is a stop token, which in this case I assume that this token is at the 1st index of the probs vector. Otherwise, if the model does not find the stop token, then it is going to convert the last predicted word into its 512 (EMBED_DIM)-dimensional vector (#(10)), allowing it to be used as the input features for the next iteration. Lastly, the generated word sequence will be returned once the loop is completed (#(11)).

We are going to simulate the forward pass for the training phase using the Codeblock 9 below. Here I pass two tensors through the show_and_tell model (#(1)), each representing a raw image of size 3×224×224 (#(2)) and a sequence of tokenized words (#(3)). Based on the resulting output, we found that our model works properly as the two input tensors successfully passed through the InceptionEncoder and the LSTMDecoder part of the network.

# Codeblock 9
show_and_tell = ShowAndTell()  #(1)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))      #(3)

captions = show_and_tell(images, captions)

# Codeblock 9 Output
after encoder : torch.Size([1, 512])
after decoder : torch.Size([1, 31, 10000])

Now, let’s assume that our show_and_tell model is already trained on an image captioning dataset, and thus ready to be used for inference. Look at the Codeblock 10 below to see how I do it. Here we set the model to eval() mode (#(1)), initialize the input image (#(2)), and pass it through the model using the generate() method (#(3)).

# Codeblock 10
show_and_tell.eval()  #(1)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)

with torch.no_grad():
    generated_tokens = show_and_tell.generate(images)  #(3)

The flow of the tensor can be seen in the output below. Here I truncate the resulting outputs because it only shows the same token generation process 30 times.

# Codeblock 10 Output
after encoder    : torch.Size([1, 512])

iteration #0
after unsqueeze  : torch.Size([1, 1, 512])
after lstm       : torch.Size([1, 1, 512])
after squeeze    : torch.Size([1, 512])
after linear     : torch.Size([1, 10000])
after max        : torch.Size([1])
after embedding  : torch.Size([1, 512])

iteration #1
after unsqueeze  : torch.Size([1, 1, 512])
after lstm       : torch.Size([1, 1, 512])
after squeeze    : torch.Size([1, 512])
after linear     : torch.Size([1, 10000])
after max        : torch.Size([1])
after embedding  : torch.Size([1, 512])

.
.
.
.

To see what the resulting caption looks like, we can just print out the generated_tokens list as shown below. Keep in mind that this sequence is still in the form of tokenized words. Later, in the post-processing stage, we will need to convert them back to the words corresponding to these numbers.

# Codeblock 11
generated_tokens

# Codeblock 11 Output
[5627,
 3906,
 2370,
 2299,
 4952,
 9933,
 402,
 7775,
 602,
 4414,
 8667,
 6774,
 9345,
 8750,
 3680,
 4458,
 1677,
 5998,
 8572,
 9556,
 7347,
 6780,
 9672,
 2596,
 9218,
 1880,
 4396,
 6168,
 7999,
 454]

Ending

With the above output, we’ve reached the end of our discussion on image captioning. Over time, many other researchers attempted to make improvements to accomplish this task. So, I think in the upcoming article I will discuss the state-of-the-art method on this topic.

Thanks for reading, I hope you learn something new today!

_By the way you can also find the code used in this article here._

References

[1] Oriol Vinyals et al. Show and Tell: A Neural Image Caption Generator. Arxiv. https://arxiv.org/pdf/1411.4555 [Accessed November 13, 2024].

[2] Christian Szegedy et al. Going Deeper with Convolutions. Arxiv. https://arxiv.org/pdf/1409.4842 [Accessed November 13, 2024].

[3] Marc Tanti et al. Where to put the Image in an Image Caption Generator. Arxiv. https://arxiv.org/pdf/1703.09137 [Accessed November 13, 2024].

[4] Stepan Ulyanin. Captioning Images with CNN and RNN, using PyTorch. Medium. https://medium.com/@stepanulyanin/captioning-images-with-pytorch-bc592e5fd1a3 [Accessed November 16, 2024].

[5] Saketh Kotamraju. How to Build an Image-Captioning Model in Pytorch. Towards Data Science. https://towardsdatascience.com/how-to-build-an-image-captioning-model-in-pytorch-29b9d8fe2f8c [Accessed November 16, 2024].

[6] Code with Aarohi. Image Captioning using CNN and RNN | Image Captioning using Deep Learning. YouTube. https://www.youtube.com/watch?v=htNmFL2BG34 [Accessed November 16, 2024].

The post Show and Tell appeared first on Towards Data Science.

Extracting Structured Vehicle Data from Images

Lihi Gur Arie, PhD — Mon, 27 Jan 2025 17:38:45 +0000

Image was generated by author on PicLumen

Introduction

Imagine there is a camera monitoring cars at an inspection point, and your mission is to document complex vehicle details – type, license plate number, make, model and color. The task is challenging – classic Computer Vision methods struggle with varied patterns, while supervised deep learning requires integrating multiple specialized models, extensive labeled data, and tedious training. Recent advancements in the pre-trained Multimodal LLMs (MLLMs) field offer fast and flexible solutions, but adapting them for structured outputs requires adjustments.

In this tutorial, we’ll build a vehicle documentation system that extracts essential details from vehicle images. These details will be extracted in a structured format, making it accessible for further downstream use. We’ll use OpenAI‘s GPT-4 to extract the data, Pydantic to structure the outputs, and LangChain to orchestrate the pipeline. By the end, you’ll have a practical pipeline for transforming raw images into structured, actionable data.

This tutorial is aimed at computer vision practitioners, data scientists, and developers who are interested in using LLMs for visual tasks. The full code is provided in an easy-to-use Colab notebook to help you follow along step-by-step.

Technology Stack

GPT-4 Vision Model: GPT-4 is a multimodal model developed by OpenAI, capable of understanding both text and images [1]. Trained on vast amounts of multimodal data, it can generalize across a wide variety of tasks in a zero-shot manner, often without the need for fine-tuning. While the exact architecture and size of GPT-4 have not been publicly disclosed, its capabilities are among the most advanced in the field. GPT-4 is available via the OpenAI API on a paid token basis. In this tutorial, we use GPT-4 for its excellent zero-shot performance, but the code allows for easy swapping with other models based on your needs.
Langchain: For building the pipeline, we will use LangChain. LangChain is a powerful framework that simplifies complex workflows, ensures consistency in the code, and makes it easy to switch between LLM models [2]. In our case, Langchain will help us to link the steps of loading images, generating prompts, invoking the GPT model, and parsing the output into structured data.
Pydantic: Pydantic is a powerful library for data validation in Python [3]. We’ll use Pydantic to define the structure of the expected output from the GPT-4 model. This will help us ensure that the output is consistent and easy to work with.

Dataset Overview

To simulate data from a vehicle inspection checkpoint, we’ll use a sample of vehicle images from the ‘Car Number plate’ Kaggle dataset [4]. This dataset is available under the Apache 2.0 License. You can view the images below:

Vehicle images from Car Number plate‘ Kaggle dataset

Lets Code!

Before diving into the practical implementation, we need to take care of some preparations:

Generate an OpenAI API key— The OpenAI API is a paid service. To use the API, you need to sign up for an OpenAI account and generate a secret API key linked to the paid plan (learn more).
Configure your OpenAI – In Colab, you can securely store your API key as an environment variables (secret), found on the left sidebar ( ). Create a secret named OPENAI_API_KEY, paste your API key into the value field, and toggle ‘Notebook access’ on.
Install and import the required libraries.

Pipeline Architecture

In this implementation we will use LangChain’s chain abstraction to link together a sequence of steps in the pipeline. Our pipeline chain is composed of 4 components: an image loading component, a prompt generation component, an MLLM invoking component and a parser component to parse the LLM’s output into structured format. The inputs and outputs for each step in a chain are typically structured as dictionaries, where the keys represent the parameter names, and the values are the actual data. Let’s see how it works.

Image Loading Component

The first step in the chain is loading the image and converting it into base64 encoding, since GPT-4 requires the image be in a text-based (base64) format.

def image_encoding(inputs):
    """Load and Convert image to base64 encoding"""

    with open(inputs["image_path"], "rb") as image_file:
        image_base64 = base64.b64encode(image_file.read()).decode("utf-8")
    return {"image": image_base64}

The inputs parameter is a dictionary containing the image path, and the output is a dictionary containing the based64-encoded image.

Define the output structure with Pydantic

We begin by specifying the required output structure using a class named Vehicle which inherits from Pydantic’s BaseModel. Each field (e.g., Type, Licence, Make, Model, Color) is defined using Field, which allows us to:

Specify the output data type (e.g., str, int, list, etc.).
Provide a description of the field for the LLM.
Include examples to guide the LLM.

The ... (ellipsis) in each Field indicates that the field is required and cannot be omitted.

Here’s how the class looks:

class Vehicle(BaseModel):

    Type: str = Field(
        ...,
        examples=["Car", "Truck", "Motorcycle", 'Bus'],
        description="Return the type of the vehicle.",
    )

    License: str = Field(
        ...,
        description="Return the license plate number of the vehicle.",
    )

    Make: str = Field(
        ...,
        examples=["Toyota", "Honda", "Ford", "Suzuki"],
        description="Return the Make of the vehicle.",
    )

    Model: str = Field(
        ...,
        examples=["Corolla", "Civic", "F-150"],
        description="Return the Model of the vehicle.",
    )

    Color: str = Field(
        ...,
        example=["Red", "Blue", "Black", "White"],
        description="Return the color of the vehicle.",
    )

Parser Component

To make sure the LLM output matches our expected format, we use the JsonOutputParser initialized with the Vehicle class. This parser validates that the output follows the structure we’ve defined, verifying the fields, types, and constrains. If the output does not match the expected format, the parser will raise a validation error.

The parser.get_format_instructions() method generates a string of instructions based on the schema from the Vehicle class. These instructions will be part of the prompt and will guide the model on how to structure its output so it can be parsed. You can view the instructions variable content in the Colab notebook.

parser = JsonOutputParser(pydantic_object=Vehicle)
instructions = parser.get_format_instructions()

Prompt Generation component

The next component in our pipeline is constructing the prompt. The prompt is composed of a system prompt and a human prompt:

System prompt: Defined in the SystemMessage , and we use it to establish the AI’s role.
Human prompt: Defined in the HumanMessage and consists of 3 parts: 1) Task description 2) Format instructions which we pulled from the parser, and 3) the image in a base64 format, and the image quality detail parameter.

The detail parameter controls how the model processes the image and generates its textual understanding [5]. It has three options: low, high or auto :

low: The model processes a low resolution (512 x 512 px) version of the image, and represents the image with a budget of 85 tokens. This allows the API to return faster responses and consume fewer input tokens.
high : The model first analyses a low resolution image (85 tokens) and then creates detailed crops using 170 tokens per 512 x 512px tile.
auto : The default setting, where the low or high setting is automatically chosen by the image size.

For our setup, low resolution is sufficient, but other application may benefit from high resolution option.

Here’s the implementation for the prompt creation step:

@chain
def prompt(inputs):
    """Create the prompt"""

    prompt = [
    SystemMessage(content="""You are an AI assistant whose job is to inspect an image and provide the desired information from the image. If the desired field is not clear or not well detected, return none for this field. Do not try to guess."""),
    HumanMessage(
        content=[
            {"type": "text", "text": """Examine the main vehicle type, make, model, license plate number and color."""},
            {"type": "text", "text": instructions},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{inputs['image']}", "detail": "low", }}]
        )
    ]
    return prompt

The @chain decorator is used to indicate that this function is part of a LangChain pipeline, where the results of this function can be passed to the step in the workflow.

MLLM Component

The next step in the pipeline is invoking the MLLM to produce the information from the image, using the MLLM_response function.

First we initialize a multimodal GTP-4 model with ChatOpenAI, with the following configurations:

model specifies the exact version of the GPT-4 model.
temperature set to 0.0 to ensure a deterministic response.
max_token Limits the maximum length of the output to 1024 tokens.

Next, we invoke the GPT-4 model using model.invoke with the assembled inputs, which include the image and prompt. The model processes the inputs and returns the information from the image.

@chain
def MLLM_response(inputs):
    """Invoke GPT model to extract information from the image"""

    model: ChatOpenAI = ChatOpenAI(
        model="gpt-4o-2024-08-06",
        temperature=0.0,
        max_tokens=1024,
    )
    output = model.invoke(inputs)
    return output.content

Constructing the pipeline Chain

After all of the components are defined, we connect them with the | operator to construct the pipeline chain. This operator sequentially links the outputs of one step to the inputs of the next, creating a smooth workflow.

Inference on a Single Image

Now comes the fun part! We can extract information from a vehicle image by passing a dictionary containing the image path, to the pipeline.invoke method. Here’s how it works:

output = pipeline.invoke({"image_path": f"{img_path}"})

The output is a dictionary with the vehicle details:

Left: The input image. Right: The output dictionary.

For further integration with databases or API responses, we can easily convert the output dictionary to JSON:

json_output = json.dumps(output)

Inference on a Batch of Images

LangChain simplifies batch inference by allowing you to process multiple images simultaneously. To do this, you should pass a list of dictionaries containing image paths, and invoking the pipeline using pipeline.batch:

# Prepare a list of dictionaries with image paths:
batch_input = [{"image_path": path} for path in image_paths]

# Perform batch inference:
output = pipeline.batch(batch_input)

The resulting output dictionary can be easily converted into tabular data, such as a Pandas DataFrame:

df = pd.DataFrame(output)

Left: The output vehicle data as a DataFrame. Right: The input images.

As we can see, the GPT-4 model correctly identified the vehicle type, licence plate, make, model and color, providing accurate and structured information. Where the details were not clearly visible, as in the motorcycle image, it returned ‘None’ as instructed in the prompt.

Concluding Remarks

In this tutorial we learned how to extract structured data from images and used it to build a vehicle documentation system. The same principles can be adapted to a wide range of other applications as well. We utilized the GPT-4 model, which showed strong performance in identifying vehicle details. However, our LangChain based implementation is flexible, allowing for easy integration with other MLLM models. While we achieved good results, it is important to remain mindful of potential allocations, which can be arise with LLM based models.

Practitioners should also consider potential privacy and safety risks when implementing similar systems. Though data in the OpenAI API platform is not used to train models by default [6], handling sensitive data requires adherence to proper regulations.

Full Code as Colab notebook:

Thank you for reading!

Congratulations on making it all the way here. Click x50 to show your appreciation and raise the algorithm self esteem

Want to learn more?

Explore additional articles I’ve written
Subscribe to get notified when I publish articles
Follow me on Linkedin

References

[1] GPT-4 Technical Report [link]

[2] LangChain [link]

[3] PyDantic [link]

[4] ‘Car Number plate’ Kaggle dataset [link]

[5] OpenAI – Low or high fidelity image understanding [link]

[6] Enterprise privacy at OpenAI [link]

The post Extracting Structured Vehicle Data from Images appeared first on Towards Data Science.