Deep Learning | Towards Data Science

The Art of Noise

Muhammad Ardi — Thu, 03 Apr 2025 01:12:22 +0000

Introduction

In my last several articles I talked about generative deep learning algorithms, which mostly are related to text generation tasks. So, I think it would be interesting to switch to generative algorithms for image generation now. We knew that nowadays there have been plenty of deep learning models specialized for generating images out there, such as Autoencoder, Variational Autoencoder (VAE), Generative Adversarial Network (GAN) and Neural Style Transfer (NST). I actually got some of my writings about these topics posted on Medium as well. I provide you the links at the end of this article if you want to read them.

In today’s article, I would like to discuss the so-called diffusion model — one of the most impactful models in the field of deep learning for image generation. The idea of this algorithm was first proposed in the paper titled Deep Unsupervised Learning using Nonequilibrium Thermodynamics written by Sohl-Dickstein et al. back in 2015 [1]. Their framework was then developed further by Ho et al. in 2020 in their paper titled Denoising Diffusion Probabilistic Models [2]. DDPM was later adapted by OpenAI and Google to develop DALLE-2 and Imagen, which we knew that these models have impressive capabilities to generate high-quality images.

How Diffusion Model Works

Generally speaking, diffusion model works by generating image from noise. We can think of it like an artist transforming a splash of paint on a canvas into a beautiful artwork. In order to do so, the diffusion model needs to be trained first. There are two main steps required to be followed to train the model, namely forward diffusion and backward diffusion.

Figure 1. The forward and backward diffusion process [3].

As you can see in the above figure, forward diffusion is a process where Gaussian noise is applied to the original image iteratively. We keep adding the noise until the image is completely unrecognizable, at which point we can say that the image now lies in the latent space. Different from Autoencoders and GANs where the latent space typically has a lower dimension than the original image, the latent space in DDPM maintains the exact same dimensionality as the original one. This noising process follows the principle of a Markov Chain, meaning that the image at timestep t is affected only by timestep t-1. Forward diffusion is considered easy since what we basically do is just adding some noise step by step.

The second training phase is called backward diffusion, which our objective here is to remove the noise little by little until we obtain a clear image. This process follows the principle of the reverse Markov Chain, where the image at timestep t-1 can only be obtained based on the image at timestep t. Such a denoising process is really difficult since we need to guess which pixels are noise and which ones belong to the actual image content. Thus, we need to employ a neural network model to do so.

DDPM uses U-Net as the basis of the deep learning architecture for backward diffusion. However, instead of using the original U-Net model [4], we need to make several modifications to it so that it will be more suitable for our task. Later on, I am going to train this model on the MNIST Handwritten Digit dataset [5], and we will see whether it can generate similar images.

Well, that was pretty much all the fundamental concepts you need to know about diffusion models for now. In the next sections we are going to get even deeper into the details while implementing the algorithm from scratch.

PyTorch Implementation

We are going to start by importing the required modules. In case you’re not yet familiar with the imports below, both torch and torchvision are the libraries we’ll use for preparing the model and the dataset. Meanwhile, matplotlib and tqdm will help us display images and progress bars.

# Codeblock 1
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from tqdm import tqdm

As the modules have been imported, the next thing to do is to initialize some config parameters. Look at the Codeblock 2 below for the details.

# Codeblock 2
IMAGE_SIZE     = 28     #(1)
NUM_CHANNELS   = 1      #(2)

BATCH_SIZE     = 2
NUM_EPOCHS     = 10
LEARNING_RATE  = 0.001

NUM_TIMESTEPS  = 1000   #(3)
BETA_START     = 0.0001 #(4)
BETA_END       = 0.02   #(5)
TIME_EMBED_DIM = 32     #(6)
DEVICE = torch.device("cuda" if torch.cuda.is_available else "cpu")  #(7)
DEVICE

# Codeblock 2 Output
device(type='cuda')

At the lines marked with #(1) and #(2) I set IMAGE_SIZE and NUM_CHANNELS to 28 and 1, which these numbers are obtained from the image dimension in the MNIST dataset. The BATCH_SIZE, NUM_EPOCHS, and LEARNING_RATE variables are pretty straightforward, so I don’t think I need to explain them further.

At line #(3), the variable NUM_TIMESTEPS denotes the number of iterations in the forward and backward diffusion process. Timestep 0 is the condition where the image is in its original state (the leftmost image in Figure 1). In this case, since we set this parameter to 1000, timestep number 999 is going to be the condition where the image is completely unrecognizable (the rightmost image in Figure 1). It is important to keep in mind that the choice of the number of timesteps involves a tradeoff between model accuracy and computational cost. If we assign a small value for NUM_TIMESTEPS, the inference time is going to be shorter, yet the resulting image might not be really good since the model has fewer steps to refine the image in the backward diffusion stage. On the other hand, increasing NUM_TIMESTEPS will slow down the inference process, but we can expect the output image to have better quality thanks to the gradual denoising process which results in a more precise reconstruction.

Next, the BETA_START (#(4)) and BETA_END (#(5)) variables are used to control the amount of Gaussian noise added at each timestep, whereas TIME_EMBED_DIM (#(6)) is employed to determine the feature vector length for storing the timestep information. Lastly, at line #(7) I assign “cuda” to the DEVICE variable if Pytorch detects GPU installed in our machine. I highly recommend you run this project on GPU since training a diffusion model is computationally expensive. In addition to the above parameters, the values set for NUM_TIMESTEPS, BETA_START and BETA_END are all adopted directly from the DDPM paper [2].

The complete implementation will be done in several steps: constructing the U-Net model, preparing the dataset, defining noise scheduler for the diffusion process, training, and inference. We are going to discuss each of those stages in the following sub-sections.

The U-Net Architecture: Time Embedding

As I’ve mentioned earlier, the basis of a diffusion model is U-Net. This architecture is used because its output layer is suitable to represent an image, which definitely makes sense since it was initially introduced for image segmentation task at the first place. The following figure shows what the original U-Net architecture looks like.

Figure 2. The original U-Net model proposed in [4].

However, it is necessary to modify this architecture so that it can also take into account the timestep information. Not only that, since we will only use MNIST dataset, we also need to make the model smaller. Just remember the convention in deep learning that simpler models are often more effective for simple tasks.

In the figure below I show you the entire U-Net model that has been modified. Here you can see that the time embedding tensor is injected to the model at every stage, which will later be done by element-wise summation, allowing the model to capture the timestep information. Next, instead of repeating each of the downsampling and the upsampling stages four times like the original U-Net, in this case we will only repeat each of them twice. Additionally, it is worth noting that the stack of downsampling stages is also known as the encoder, whereas the stack of upsampling stages is often called the decoder.

Figure 3. The modified U-Net model for our diffusion task [3].

Now let’s start constructing the architecture by creating a class for generating the time embedding tensor, which the idea is similar to the positional embedding in Transformer. See the Codeblock 3 below for the details.

# Codeblock 3
class TimeEmbedding(nn.Module):
    def forward(self):
        time = torch.arange(NUM_TIMESTEPS, device=DEVICE).reshape(NUM_TIMESTEPS, 1)  #(1)
        print(f"time\t\t: {time.shape}")
          
        i = torch.arange(0, TIME_EMBED_DIM, 2, device=DEVICE)
        denominator = torch.pow(10000, i/TIME_EMBED_DIM)
        print(f"denominator\t: {denominator.shape}")
          
        even_time_embed = torch.sin(time/denominator)  #(1)
        odd_time_embed  = torch.cos(time/denominator)  #(2)
        print(f"even_time_embed\t: {even_time_embed.shape}")
        print(f"odd_time_embed\t: {odd_time_embed.shape}")
          
        stacked = torch.stack([even_time_embed, odd_time_embed], dim=2)  #(3)
        print(f"stacked\t\t: {stacked.shape}")
        time_embed = torch.flatten(stacked, start_dim=1, end_dim=2)  #(4)
        print(f"time_embed\t: {time_embed.shape}")
          
        return time_embed

What we basically do in the above code is to create a tensor of size NUM_TIMESTEPS × TIME_EMBED_DIM (1000×32), where every single row of this tensor will contain the timestep information. Later on, each of the 1000 timesteps will be represented by a feature vector of length 32. The values in the tensor themselves are obtained based on the two equations in Figure 4. In the Codeblock 3 above, these two equations are implemented at line #(1) and #(2), each forming a tensor having the size of 1000×16. Next, these tensors are combined using the code at line #(3) and #(4).

Here I also print out every single step done in the above codeblock so that you can get a better understanding of what is actually being done in the TimeEmbedding class. If you still want more explanation about the above code, feel free to read my previous post about Transformer which you can access through the link at the end of this article. Once you clicked the link, you can just scroll all the way down to the Positional Encoding section.

Figure 4. The sinusoidal positional encoding formula from the Transformer paper [6].

Now let’s check if the TimeEmbedding class works properly using the following testing code. The resulting output shows that it successfully produced a tensor of size 1000×32, which is exactly what we expected earlier.

# Codeblock 4
time_embed_test = TimeEmbedding()
out_test = time_embed_test()

# Codeblock 4 Output
time            : torch.Size([1000, 1])
denominator     : torch.Size([16])
even_time_embed : torch.Size([1000, 16])
odd_time_embed  : torch.Size([1000, 16])
stacked         : torch.Size([1000, 16, 2])
time_embed      : torch.Size([1000, 32])

The U-Net Architecture: DoubleConv

If you take a closer look at the modified architecture, you will see that we actually got lots of repeating patterns, such as the ones highlighted in yellow boxes in the following figure.

Figure 5. The processes done inside the yellow boxes will be implemented in the DoubleConv class [3].

These five yellow boxes share the same structure, where they consist of two convolution layers with the time embedding tensor injected right after the first convolution operation is performed. So, what we are going to do now is to create another class named DoubleConv to reproduce this structure. Look at the Codeblock 5a and 5b below to see how I do that.

# Codeblock 5a
class DoubleConv(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        super().__init__()
        
        self.conv_0 = nn.Conv2d(in_channels=in_channels,  #(2)
                                out_channels=out_channels, 
                                kernel_size=3, 
                                bias=False, 
                                padding=1)
        self.bn_0 = nn.BatchNorm2d(num_features=out_channels)  #(3)
        
        self.time_embedding = TimeEmbedding()  #(4)
        self.linear = nn.Linear(in_features=TIME_EMBED_DIM,  #(5)
                                out_features=out_channels)
        
        self.conv_1 = nn.Conv2d(in_channels=out_channels,  #(6)
                                out_channels=out_channels, 
                                kernel_size=3, 
                                bias=False, 
                                padding=1)
        self.bn_1 = nn.BatchNorm2d(num_features=out_channels)  #(7)
        
        self.relu = nn.ReLU(inplace=True)  #(8)

The two inputs of the __init__() method above gives us flexibility to configure the number of input and output channels (#(1)) so that the DoubleConv class can be used to instantiate all the five yellow boxes simply by adjusting its input arguments. As the name suggests, here we initialize two convolution layers (line #(2) and #(6)), each followed by a batch normalization layer and a ReLU activation function. Keep in mind that the two normalization layers need to be initialized separately (line #(3) and #(7)) since each of them has their own trainable normalization parameters. Meanwhile, the ReLU activation function should only be initialized once (#(8)) because it contains no parameters, allowing it to be used multiple times in different parts of the network. At line #(4), we initialize the TimeEmbedding layer we created earlier, which will later be connected to a standard linear layer (#(5)). This linear layer is responsible to adjust the dimension of the time embedding tensor so that the resulting output can be summed with the output from the first convolution layer in an element-wise manner.

Now let’s take a look at the Codeblock 5b below to better understand the flow of the DoubleConv block. Here you can see that the forward() method accepts two inputs: the raw image x and the timestep information t as shown at line #(1). We initially process the image with the first Conv-BN-ReLU sequence (#(2–4)). This Conv-BN-ReLU structure is typically used when working with CNN-based models, even if the illustration does not explicitly show the batch normalization and the ReLU layers. Apart from the image, we then take the t-th timestep information from our embedding tensor of the corresponding image (#(5)) and pass it through the linear layer (#(6)). We still need to expand the dimension of the resulting tensor using the code at line #(7) before performing element-wise summation at line #(8). Finally, we process the resulting tensor with the second Conv-BN-ReLU sequence (#(9–11)).

# Codeblock 5b
    def forward(self, x, t):  #(1)
        print(f'images\t\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        
        x = self.conv_0(x)  #(2)
        x = self.bn_0(x)    #(3)
        x = self.relu(x)    #(4)
        print(f'\nafter first conv\t: {x.size()}')
        
        time_embed = self.time_embedding()[t]      #(5)
        print(f'\ntime_embed\t\t: {time_embed.size()}')
        
        time_embed = self.linear(time_embed)       #(6)
        print(f'time_embed after linear\t: {time_embed.size()}')
        
        time_embed = time_embed[:, :, None, None]  #(7)
        print(f'time_embed expanded\t: {time_embed.size()}')
        
        x = x + time_embed  #(8)
        print(f'\nafter summation\t\t: {x.size()}')
        
        x = self.conv_1(x)  #(9)
        x = self.bn_1(x)    #(10)
        x = self.relu(x)    #(11)
        print(f'after second conv\t: {x.size()}')
        
        return x

To see if our DoubleConv implementation works properly, we are going to test it with the Codeblock 6 below. Here I want to simulate the very first instance of this block, which corresponds to the leftmost yellow box in Figure 5. To do so, we need to we need to set the in_channels and out_channels parameters to 1 and 64, respectively (#(1)). Next, we initialize two input tensors, namely x_test and t_test. The x_test tensor has the size of 2×1×28×28, representing a batch of two grayscale images having the size of 28×28 (#(2)). Keep in mind that this is just a dummy tensor of random values which will be replaced with the actual images from MNIST dataset later in the training phase. Meanwhile, t_test is a tensor containing the timestep numbers of the corresponding images (#(3)). The values for this tensor are randomly selected between 0 and NUM_TIMESTEPS (1000). Note that the datatype of this tensor must be an integer since the numbers will be used for indexing, as shown at line #(5) back in Codeblock 5b. Lastly, at line #(4) we pass both x_test and t_test tensors to the double_conv_test layer.

By the way, I re-run the previous codeblocks with the print() functions removed prior to running the following code so that the outputs will look neater.

# Codeblock 6
double_conv_test = DoubleConv(in_channels=1, out_channels=64).to(DEVICE)  #(1)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)  #(2)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)  #(3)

out_test = double_conv_test(x_test, t_test)  #(4)

# Codeblock 6 Output
images                  : torch.Size([2, 1, 28, 28])   #(1)
timesteps               : torch.Size([2]), tensor([468, 304], device='cuda:0')  #(2)

after first conv        : torch.Size([2, 64, 28, 28])  #(3)

time_embed              : torch.Size([2, 32])          #(4)
time_embed after linear : torch.Size([2, 64])
time_embed expanded     : torch.Size([2, 64, 1, 1])    #(5)

after summation         : torch.Size([2, 64, 28, 28])  #(6)
after second conv       : torch.Size([2, 64, 28, 28])  #(7)

The shape of our original input tensors can be seen at lines #(1) and #(2) in the above output. Specifically at line #(2), I also print out the two timesteps that we selected randomly. In this example we assume that each of the two images in the x tensor are already noised with the noise level from 468-th and 304-th timesteps prior to being fed into the network. We can see that the shape of the image tensor x changes to 2×64×28×28 after being passed through the first convolution layer (#(3)). Meanwhile, the size of our time embedding tensor becomes 2×32 (#(4)), which is obtained by extracting rows 468 and 304 from the original embedding of size 1000×32. In order to allow element-wise summation to be performed (#(6)), we need to map the 32-dimensional time embedding vectors into 64 and expand their axes, resulting in a tensor of size 2×64×1×1 (#(5)) so that it can be broadcast to the 2×64×28×28 tensor. After the summation is done, we then pass the tensor through the second convolution layer, at which point the tensor dimension does not change at all (#(7)).

The U-Net Architecture: Encoder

As we have successfully implemented the DoubleConv block, the next step to do is to implement the so-called DownSample block. In Figure 6 below, this corresponds to the parts enclosed in the red box.

Figure 6. The parts of the network highlighted in red are the so-called DownSample blocks [3].

The purpose of a DownSample block is to reduce the spatial dimension of an image, but it is important to note that at the same time it increases the number of channels. In order to achieve this, we can simply stack a DoubleConv block and a maxpooling operation. In this case the pooling uses 2×2 kernel size with the stride of 2, causing the spatial dimension of the image to be twice as small as the input. The implementation of this block can be seen in Codeblock 7 below.

# Codeblock 7
class DownSample(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        super().__init__()
        
        self.double_conv = DoubleConv(in_channels=in_channels,  #(2)
                                      out_channels=out_channels)
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)    #(3)
    
    def forward(self, x, t):  #(4)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        
        convolved = self.double_conv(x, t)   #(5)
        print(f'\nafter double conv\t: {convolved.size()}')
        
        maxpooled = self.maxpool(convolved)  #(6)
        print(f'after pooling\t\t: {maxpooled.size()}')
        
        return convolved, maxpooled          #(7)

Here I set the __init__() method to take number of input and output channels so that we can use it for creating the two DownSample blocks highlighted in Figure 6 without needing to write them in separate classes (#(1)). Next, the DoubleConv and the maxpooling layers are initialized at line #(2) and #(3), respectively. Remember that since the DoubleConv block accepts image x and the corresponding timestep t as the inputs, we also need to set the forward() method of this DownSample block such that it accepts both of them as well (#(4)). The information contained in x and t are then combined as the two tensors are processed by the double_conv layer, which the output is stored in the variable named convolved (#(5)). Afterwards, we now actually perform the downsampling with the maxpooling operation at line #(6), producing a tensor named maxpooled. It is important to note that both the convolved and maxpooled tensors are going to be returned, which is essentially done because we will later bring maxpooled to the next downsampling stage, whereas the convolved tensor will be transferred directly to the upsampling stage in the decoder through skip-connections.

Now let’s test the DownSample class using the Codeblock 8 below. The input tensors used here are exactly the same as the ones in Codeblock 6. Based on the resulting output, we can see that the pooling operation successfully converted the output of the DoubleConv block from 2×64×28×28 (#(1)) to 2×64×14×14 (#(2)), indicating that our DownSample class works properly.

# Codeblock 8
down_sample_test = DownSample(in_channels=1, out_channels=64).to(DEVICE)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)

out_test = down_sample_test(x_test, t_test)

# Codeblock 8 Output
original          : torch.Size([2, 1, 28, 28])
timesteps         : torch.Size([2]), tensor([468, 304], device='cuda:0')

after double conv : torch.Size([2, 64, 28, 28])  #(1)
after pooling     : torch.Size([2, 64, 14, 14])  #(2)

The U-Net Architecture: Decoder

We need to introduce the so-called UpSample block in the decoder, which is responsible for reverting the tensor in the intermediate layers to the original image dimension. In order to maintain a symmetrical structure, the number of UpSample blocks must match that of the DownSample blocks. Look at the Figure 7 below to see where the two UpSample blocks are placed.

Figure 7. The components inside the blue boxes are the so-called UpSample blocks [3].

Since both UpSample blocks are structurally identical, we can just initialize a single class for them, just like the DownSample class we created earlier. Look at the Codeblock 9 below to see how I implement it.

# Codeblock 9
class UpSample(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        
        self.conv_transpose = nn.ConvTranspose2d(in_channels=in_channels,  #(1)
                                                 out_channels=out_channels, 
                                                 kernel_size=2, stride=2)  #(2)
        self.double_conv = DoubleConv(in_channels=in_channels,  #(3)
                                      out_channels=out_channels)
        
    def forward(self, x, t, connection):  #(4)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        print(f'connection\t\t: {connection.size()}')
        
        x = self.conv_transpose(x)  #(5)
        print(f'\nafter conv transpose\t: {x.size()}')
        
        x = torch.cat([x, connection], dim=1)  #(6)
        print(f'after concat\t\t: {x.size()}')
        
        x = self.double_conv(x, t)  #(7)
        print(f'after double conv\t: {x.size()}')
        
        return x

In the __init__() method, we use nn.ConvTranspose2d to upsample the spatial dimension (#(1)). Both the kernel size and stride are set to 2 so that the output will be twice as large (#(2)). Next, the DoubleConv block will be employed to reduce the number of channels, while at the same time combining the timestep information from the time embedding tensor (#(3)).

The flow of this UpSample class is a bit more complicated than the DownSample class. If we take a closer look at the architecture, we’ll see that that we also have a skip-connection coming directly from the encoder. Thus, we need the forward() method to accept another argument in addition to the original image x and the timestep t, namely the residual tensor connection (#(4)). The first thing we do inside this method is to process the original image x with the transpose convolution layer (#(5)). In fact, not only upsampling the spatial size, but this layer also reduces the number of channels at the same time. However, the resulting tensor is then directly concatenated with connection in a channel-wise manner (#(6)), causing it to seem like no channel reduction is performed. It is important to know that at this point these two tensors are just concatenated, meaning that the information from the two are not yet combined. We finally feed these concatenated tensors to the double_conv layer (#(7)), allowing them to share information to each other through the learnable parameters inside the convolution layers.

The Codeblock 10 below shows how I test the UpSample class. The size of the tensors to be passed through are set according to the second upsampling block, i.e., the rightmost blue box in Figure 7.

# Codeblock 10
up_sample_test = UpSample(in_channels=128, out_channels=64).to(DEVICE)

x_test = torch.randn((BATCH_SIZE, 128, 14, 14)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)
connection_test = torch.randn((BATCH_SIZE, 64, 28, 28)).to(DEVICE)

out_test = up_sample_test(x_test, t_test, connection_test)

In the resulting output below, if we compare the input tensor (#(1)) with the final tensor shape (#(2)), we can clearly see that the number of channels successfully reduced from 128 to 64, while at the same time the spatial dimension increased from 14×14 to 28×28. This essentially means that our UpSample class is now ready to be used in the main U-Net architecture.

# Codeblock 10 Output
original             : torch.Size([2, 128, 14, 14])   #(1)
timesteps            : torch.Size([2]), tensor([468, 304], device='cuda:0')
connection           : torch.Size([2, 64, 28, 28])

after conv transpose : torch.Size([2, 64, 28, 28])
after concat         : torch.Size([2, 128, 28, 28])
after double conv    : torch.Size([2, 64, 28, 28])    #(2)

The U-Net Architecture: Putting All Components Together

Once all U-Net components have been created, what we are going to do next is to wrap them together into a single class. Look at the Codeblock 11a and 11b below for the details.

# Codeblock 11a
class UNet(nn.Module):
    def __init__(self):
        super().__init__()
      
        self.downsample_0 = DownSample(in_channels=NUM_CHANNELS,  #(1)
                                       out_channels=64)
        self.downsample_1 = DownSample(in_channels=64,            #(2)
                                       out_channels=128)
      
        self.bottleneck   = DoubleConv(in_channels=128,           #(3)
                                       out_channels=256)
      
        self.upsample_0   = UpSample(in_channels=256,             #(4)
                                     out_channels=128)
        self.upsample_1   = UpSample(in_channels=128,             #(5)
                                     out_channels=64)
      
        self.output = nn.Conv2d(in_channels=64,                   #(6)
                                out_channels=NUM_CHANNELS,
                                kernel_size=1)

You can see in the __init__() method above that we initialize two downsampling (#(1–2)) and two upsampling (#(4–5)) blocks, which the number of input and output channels are set according to the architecture shown in the illustration. There are actually two additional components I haven’t explained yet, namely the bottleneck (#(3)) and the output layer (#(6)). The former is essentially just a DoubleConv block, which acts as the main connection between the encoder and the decoder. Look at the Figure 8 below to see which components of the network belong to the bottleneck layer. Next, the output layer is a standard convolution layer which is responsible to turn the 64-channel image produced by the last UpSampling stage into 1-channel only. This operation is done using a kernel of size 1×1, meaning that it combines information across all channels while operating independently at each pixel position.

Figure 8. The bottleneck layer (the lower part of the model) acts as the main bridge between the encoder and the decoder of U-Net [3].

I guess the forward() method of the entire U-Net in the following codeblock is pretty straightforward, as what we essentially do here is pass the tensors from one layer to another — just don’t forget to include the skip connections between the downsampling and upsampling blocks.

# Codeblock 11b
    def forward(self, x, t):  #(1)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
            
        convolved_0, maxpooled_0 = self.downsample_0(x, t)
        print(f'\nmaxpooled_0\t\t: {maxpooled_0.size()}')
            
        convolved_1, maxpooled_1 = self.downsample_1(maxpooled_0, t)
        print(f'maxpooled_1\t\t: {maxpooled_1.size()}')
            
        x = self.bottleneck(maxpooled_1, t)
        print(f'after bottleneck\t: {x.size()}')
    
        upsampled_0 = self.upsample_0(x, t, convolved_1)
        print(f'upsampled_0\t\t: {upsampled_0.size()}')
            
        upsampled_1 = self.upsample_1(upsampled_0, t, convolved_0)
        print(f'upsampled_1\t\t: {upsampled_1.size()}')
            
        x = self.output(upsampled_1)
        print(f'final output\t\t: {x.size()}')
            
        return x

Now let’s see whether we have correctly constructed the U-Net class above by running the following testing code.

# Codeblock 12
unet_test = UNet().to(DEVICE)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)

out_test = unet_test(x_test, t_test)

# Codeblock 12 Output
original         : torch.Size([2, 1, 28, 28])   #(1)
timesteps        : torch.Size([2]), tensor([468, 304], device='cuda:0')

maxpooled_0      : torch.Size([2, 64, 14, 14])  #(2)
maxpooled_1      : torch.Size([2, 128, 7, 7])   #(3)
after bottleneck : torch.Size([2, 256, 7, 7])   #(4)
upsampled_0      : torch.Size([2, 128, 14, 14])
upsampled_1      : torch.Size([2, 64, 28, 28])
final output     : torch.Size([2, 1, 28, 28])   #(5)

We can see in the above output that the two downsampling stages successfully converted the original tensor of size 1×28×28 (#(1)) into 64×14×14 (#(2)) and 128×7×7 (#(3)), respectively. This tensor is then passed through the bottleneck layer, causing its number of channels to expand to 256 without changing the spatial dimension (#(4)). Lastly, we upsample the tensor twice before eventually shrinking the number of channels to 1 (#(5)). Based on this output, it looks like our model is working properly. Thus, it is now ready to be trained for our diffusion task.

Dataset Preparation

As we have successfully created the entire U-Net architecture, the next thing to do is to prepare the MNIST Handwritten Digit dataset. Before actually loading it, we need to define the preprocessing steps first using the transforms.Compose() method from Torchvision, as shown at line #(1) in Codeblock 13. There are two things we do here: converting the images into PyTorch tensors which also scales the pixel values from 0–255 to 0–1 (#(2)), and normalize them so that the final pixel values ranging between -1 and 1 (#(3)). Next, we download the dataset using datasets.MNIST(). In this case, we are going to take the images from the training data, hence we use train=True (#(5)). Don’t forget to pass the transform variable we initialized earlier to the transform parameter (transform=transform) so that it will automatically preprocess the images as we load them (#(6)). Lastly, we need to employ DataLoader to load the images from mnist_dataset (#(7)). The arguments I use for the input parameters are intended to randomly pick BATCH_SIZE (2) images from the dataset in each iteration.

# Codeblock 13
transform = transforms.Compose([  #(1)
    transforms.ToTensor(),        #(2)
    transforms.Normalize((0.5,), (0.5,))  #(3)
])

mnist_dataset = datasets.MNIST(   #(4)
    root='./data', 
    train=True,           #(5)
    download=True, 
    transform=transform   #(6)
)

loader = DataLoader(mnist_dataset,  #(7)
                    batch_size=BATCH_SIZE,
                    drop_last=True, 
                    shuffle=True)

In the following codeblock, I try to load a batch of images from the dataset. In every iteration, loader provides both the images and the corresponding labels, hence we need to store them in two separate variables: images and labels.

# Codeblock 14
images, labels = next(iter(loader))

print('images\t\t:', images.shape)
print('labels\t\t:', labels.shape)
print('min value\t:', images.min())
print('max value\t:', images.max())

We can see in the resulting output below that the images tensor has the size of 2×1×28×28 (#(1)), indicating that two grayscale images of size 28×28 have been successfully loaded. Here we can also see that the length of the labels tensor is 2, which matches the number of the loaded images (#(2)). Note that in this case the labels are going to be completely ignored. My plan here is that I just want the model to generate any number it previously seen from the entire training dataset without even knowing what number it actually is. Lastly, this output also shows that the preprocessing works properly, as the pixel values now range between -1 and 1.

# Codeblock 14 Output
images    : torch.Size([2, 1, 28, 28])  #(1)
labels    : torch.Size([2])             #(2)
min value : tensor(-1.)
max value : tensor(1.)

Run the following code if you want to see what the image we just loaded looks like.

# Codeblock 15   
plt.imshow(images[0].squeeze(), cmap='gray')
plt.show()

Figure 9. Output from Codeblock 15 [3].

Noise Scheduler

In this section we are going to talk about how the forward and backward diffusion are performed, which the process essentially involves adding or removing noise little by little at each timestep. It is necessary to know that we basically want a uniform amount of noise across all timesteps, where in the forward diffusion the image should be completely full of noise exactly at timestep 1000, while in the backward diffusion, we have to get the completely clear image at timestep 0. Hence, we need something to control the noise amount for each timestep. Later in this section, I am going to implement a class named NoiseScheduler to do so. — This will probably be the most mathy section of this article, as I’ll display many equations here. But don’t worry about that since we’ll focus on implementing these equations rather than discussing the mathematical derivations.

Now let’s take a look at the equations in Figure 10 which I will implement in the __init__() method of the NoiseScheduler class below.

Figure 10. The equations we need to implement in the __init__() method of the NoiseScheduler class [3].

# Codeblock 16a
class NoiseScheduler:
    def __init__(self):
        self.betas = torch.linspace(BETA_START, BETA_END, NUM_TIMESTEPS)  #(1)
        self.alphas = 1. - self.betas
        self.alphas_cum_prod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cum_prod = torch.sqrt(self.alphas_cum_prod)
        self.sqrt_one_minus_alphas_cum_prod = torch.sqrt(1. - self.alphas_cum_prod)

The above code works by creating multiple sequences of numbers, all of them are basically controlled by BETA_START (0.0001), BETA_END (0.02), and NUM_TIMESTEPS (1000). The first sequence we need to instantiate is the betas itself, which is done using torch.linspace() (#(1)). What it essentially does is that it generates a 1-dimensional tensor of length 1000 starting from 0.0001 to 0.02, where every single element in this tensor corresponds to a single timestep. The interval between each element is uniform, allowing us to generate uniform amount of noise throughout all timesteps as well. With this betas tensor, we then compute alphas, alphas_cum_prod, sqrt_alphas_cum_prod and sqrt_one_minus_alphas_cum_prod based on the four equations in Figure 10. Later on, these tensors will act as the basis of how the noise is generated or removed during the diffusion process.

Diffusion is normally done in a sequential manner. However, the forward diffusion process is deterministic, hence we can derive the original equation into a closed form so that we can obtain the noise at a specific timestep without having to iteratively add noise from the very beginning. The Figure 11 below shows what the closed form of the forward diffusion looks like, where x₀ represents the original image while epsilon (ϵ) denotes an image made up of random Gaussian noise. We can think of this equation as a weighted combination, where we combine the clear image and the noise according to weights determined by the timestep, resulting in an image with a specific amount of noise.

Figure 11. The closed form of the forward diffusion process [3].

The implementation of this equation can be seen in Codeblock 16b. In this forward_diffusion() method, x₀ and ϵ are denoted as original and noise. Here you need to keep in mind that these two input variables are images, whereas sqrt_alphas_cum_prod_t and sqrt_one_minus_alphas_cum_prod_t are scalars. Thus, we need to adjust the shape of these two scalars (#(1) and #(2)) so that the operation at line #(3) can be performed. The noisy_image variable is going to be the output of this function, which I guess the name is self-explanatory.

# Codeblock 16b
    def forward_diffusion(self, original, noise, t):
        sqrt_alphas_cum_prod_t = self.sqrt_alphas_cum_prod[t]
        sqrt_alphas_cum_prod_t = sqrt_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1)  #(1)
        
        sqrt_one_minus_alphas_cum_prod_t = self.sqrt_one_minus_alphas_cum_prod[t]
        sqrt_one_minus_alphas_cum_prod_t = sqrt_one_minus_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1)  #(2)
        
        noisy_image = sqrt_alphas_cum_prod_t * original + sqrt_one_minus_alphas_cum_prod_t * noise  #(3)
        
        return noisy_image

Now let’s talk about backward diffusion. In fact, this one is a bit more complicated than the forward diffusion since we need three more equations here. Before I give you these equations, let me show you the implementation first. See the Codeblock 16c below.

# Codeblock 16c
    def backward_diffusion(self, current_image, predicted_noise, t):  #(1)
        denoised_image = (current_image - (self.sqrt_one_minus_alphas_cum_prod[t] * predicted_noise)) / self.sqrt_alphas_cum_prod[t]  #(2)
        denoised_image = 2 * (denoised_image - denoised_image.min()) / (denoised_image.max() - denoised_image.min()) - 1  #(3)
        
        current_prediction = current_image - ((self.betas[t] * predicted_noise) / (self.sqrt_one_minus_alphas_cum_prod[t]))  #(4)
        current_prediction = current_prediction / torch.sqrt(self.alphas[t])  #(5)
        
        if t == 0:  #(6)
            return current_prediction, denoised_image
        
        else:
            variance = (1 - self.alphas_cum_prod[t-1]) / (1. - self.alphas_cum_prod[t])  #(7)
            variance = variance * self.betas[t]  #(8)
            sigma = variance ** 0.5
            z = torch.randn(current_image.shape).to(DEVICE)
            current_prediction = current_prediction + sigma*z
            
            return current_prediction, denoised_image

Later in the inference phase, the backward_diffusion() method will be called inside a loop that iterates NUM_TIMESTEPS (1000) times, starting from t = 999, continued with t = 998, and so on all the way to t = 0. This function is responsible to remove the noise from the image iteratively based on the current_image (the image produced by the previous denoising step), the predicted_noise (the noise predicted by U-Net in the previous step), and the timestep information t (#(1)). In each iteration, noise removal is done using the equation shown in Figure 12, which in Codeblock 16c, this corresponds to lines #(4-5).

Figure 12. The equation used for removing noise from the image [3].

As long as we haven’t reached t = 0, we will compute the variance based on the equation in Figure 13 (#(7–8)). This variance will then be used to introduce another controlled noise to simulate the stochasticity in the backward diffusion process since the noise removal equation in Figure 12 is a deterministic approximation. This is essentially also the reason that we don’t calculate the variance once we reached t = 0 (#(6)) since we no longer need to add more noise as the image is completely clear already.

Figure 13. The equation used to calculate variance for introducing controlled noise [3].

Different from current_prediction which aims to estimate the image of the previous timestep (xₜ₋₁), the objective of the denoised_image tensor is to reconstruct the original image (x₀). Thanks to these different objectives, we need a separate equation to compute denoised_image, which can be seen in Figure 14 below. The implementation of the equation itself is written at line #(2–3).

Figure 14. The equation for reconstructing the original image [3].

Now let’s test the NoiseScheduler class we created above. In the following codeblock, I instantiate a NoiseScheduler object and print out the attributes associated with it, which are all computed using the equation in Figure 10 based on the values stored in the betas attribute. Remember that the actual length of these tensors is NUM_TIMESTEPS (1000), but here I only print out the first 6 elements.

# Codeblock 17
noise_scheduler = NoiseScheduler()

print(f'betas\t\t\t\t: {noise_scheduler.betas[:6]}')
print(f'alphas\t\t\t\t: {noise_scheduler.alphas[:6]}')
print(f'alphas_cum_prod\t\t\t: {noise_scheduler.alphas_cum_prod[:6]}')
print(f'sqrt_alphas_cum_prod\t\t: {noise_scheduler.sqrt_alphas_cum_prod[:6]}')
print(f'sqrt_one_minus_alphas_cum_prod\t: {noise_scheduler.sqrt_one_minus_alphas_cum_prod[:6]}')

# Codeblock 17 Output
betas                          : tensor([1.0000e-04, 1.1992e-04, 1.3984e-04, 1.5976e-04, 1.7968e-04, 1.9960e-04])
alphas                         : tensor([0.9999, 0.9999, 0.9999, 0.9998, 0.9998, 0.9998])
alphas_cum_prod                : tensor([0.9999, 0.9998, 0.9996, 0.9995, 0.9993, 0.9991])
sqrt_alphas_cum_prod           : tensor([0.9999, 0.9999, 0.9998, 0.9997, 0.9997, 0.9996])
sqrt_one_minus_alphas_cum_prod : tensor([0.0100, 0.0148, 0.0190, 0.0228, 0.0264, 0.0300])

The above output indicates that our __init__() method works as expected. Next, we are going to test the forward_diffusion() method. If you go back to Figure 16b, you will see that forward_diffusion() accepts three inputs: original image, noise image and the timestep number. Let’s just use the image from the MNIST dataset we loaded earlier for the first input (#(1)) and a random Gaussian noise of the exact same size for the second one (#(2)). Run the Codeblock 18 below to see what these two images look like.

# Codeblock 18
image = images[0]  #(1)
noise = torch.randn_like(image)  #(2)

plt.imshow(image.squeeze(), cmap='gray')
plt.show()
plt.imshow(noise.squeeze(), cmap='gray')
plt.show()

Figure 15. The two images to be used as the original (left) and the noise image (right). The one on the left is the same image I showed earlier in Figure 9 [3].

As we already got the image and the noise ready, what we need to do afterwards is to pass them to the forward_diffusion() method alongside the t. I actually tried to run the Codeblock 19 below multiple times with t = 50, 100, 150, and so on up to t = 300. You can see in Figure 16 that the image becomes less clear as the parameter increases. In this case, the image is going to be completely filled by noise when the t is set to 999.

# Codeblock 19
noisy_image_test = noise_scheduler.forward_diffusion(image.to(DEVICE), noise.to(DEVICE), t=50)

plt.imshow(noisy_image_test[0].squeeze().cpu(), cmap='gray')
plt.show()

Figure 16. The result of the forward diffusion process at t=50, 100, 150, and so on until t=300 [3].

Unfortunately, we cannot test the backward_diffusion() method since this process requires us to have our U-Net model trained. So, let’s just skip this part for now. I’ll show you how we can actually use this function later in the inference phase.

Training

As the U-Net model, MNIST dataset, and the noise scheduler are ready, we can now prepare a function for training. Before we do that, I instantiate the model and the noise scheduler in Codeblock 20 below.

# Codeblock 20
model = UNet().to(DEVICE)
noise_scheduler = NoiseScheduler()

The entire training procedure is implemented in the train() function shown in Codeblock 21. Before doing anything, we first initialize the optimizer and the loss function, which in this case we use Adam and MSE, respectively (#(1–2)). What we basically want to do here is to train the model such that it will be able to predict the noise contained in the input image, which later on, the predicted noise will be used as the basis of the denoising process in the backward diffusion stage. To actually train the model, we first need to perform forward diffusion using the code at line #(6). This noising process will be done on the images tensor (#(3)) using the random noise generated at line #(4). Next, we take random number somewhere between 0 and NUM_TIMESTEPS (1000) for the t (#(5)), which is essentially done because we want our model to see images of varying noise levels as an approach to improve generalization. As the noisy images have been generated, we then pass it through the U-Net model alongside the chosen t (#(7)). The input t here is useful for the model as it indicates the current noise level in the image. Lastly, the loss function we initialized earlier is responsible to compute the difference between the actual noise and the predicted noise from the original image (#(8)). So, the objective of this training is basically to make the predicted noise as similar as possible to the noise we generated at line #(4).

# Codeblock 21
def train():
    optimizer = Adam(model.parameters(), lr=LEARNING_RATE)  #(1)
    loss_function = nn.MSELoss()  #(2)
    losses = []
    
    for epoch in range(NUM_EPOCHS):
        print(f'Epoch no {epoch}')
        
        for images, _ in tqdm(loader):
            
            optimizer.zero_grad()

            images = images.float().to(DEVICE)  #(3)
            noise = torch.randn_like(images)  #(4)
            t = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,))  #(5)

            noisy_images = noise_scheduler.forward_diffusion(images, noise, t).to(DEVICE)  #(6)
            predicted_noise = model(noisy_images, t)  #(7)
            loss = loss_function(predicted_noise, noise)  #(8)
            
            losses.append(loss.item())
            loss.backward()
            optimizer.step()

    return losses

Now let’s run the above training function using the codeblock below. Sit back and relax while waiting the training completes. In my case, I used Kaggle Notebook with Nvidia GPU P100 turned on, and it took around 45 minutes to finish.

# Codeblock 22
losses = train()

If we take a look at the loss graph, it seems like our model learned pretty well as the value is generally decreasing over time with a rapid drop at early stages and a more stable (yet still decreasing) trend in the later stages. So, I think we can expect good results later in the inference phase.

# Codeblock 23
plt.plot(losses)

Figure 17. How the loss value decreases as the training goes [3].

Inference

At this point we have already got our model trained, so we can now perform inference on it. Look at the Codeblock 24 below to see how I implement the inference() function.

# Codeblock 24
def inference():

    denoised_images = []  #(1)
    
    with torch.no_grad():  #(2)
        current_prediction = torch.randn((64, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)  #(3)
        
        for i in tqdm(reversed(range(NUM_TIMESTEPS))):  #(4)
            predicted_noise = model(current_prediction, torch.as_tensor(i).unsqueeze(0))  #(5)
            current_prediction, denoised_image = noise_scheduler.backward_diffusion(current_prediction, predicted_noise, torch.as_tensor(i))  #(6)

            if i%100 == 0:  #(7)
                denoised_images.append(denoised_image)
            
        return denoised_images

At the line marked with #(1) I initialize an empty list which will be used to store the denoising result every 100 timesteps (#(7)). This will later allow us to see how the backward diffusion goes. The actual inference process is encapsulated inside torch.no_grad() (#(2)). Remember that in diffusion models we generate images from a completely random noise, which we assume that these images are initially at t = 999. To implement this, we can simply use torch.randn() as shown at line #(3). Here we initialize a tensor of size 64×1×28×28, indicating that we are about to generate 64 images simultaneously. Next, we write a for loop that iterates backwards starting from 999 to 0 (#(4)). Inside this loop, we feed the current image and the timestep as the input for the trained U-Net and let it predict the noise (#(5)). The actual backward diffusion is then performed at line #(6). At the end of the iteration, we should get new images similar to the ones we have in our dataset. Now let’s call the inference() function in the following codeblock.

# Codeblock 25
denoised_images = inference()

As the inference completed, we can now see what the resulting images look like. The Codeblock 26 below is used to display the first 42 images we just generated.

# Codeblock 26
fig, axes = plt.subplots(ncols=7, nrows=6, figsize=(10, 8))

counter = 0

for i in range(6):
    for j in range(7):
        axes[i,j].imshow(denoised_images[-1][counter].squeeze().detach().cpu().numpy(), cmap='gray')  #(1)
        axes[i,j].get_xaxis().set_visible(False)
        axes[i,j].get_yaxis().set_visible(False)
        counter += 1

plt.show()

Figure 18. The images generated by the diffusion model trained on the MNIST Handwritten Digit dataset [3].

If we take a look at the above codeblock, you can see that the indexer of [-1] at line #(1) indicates that we only display the images from the last iteration (which corresponds to timestep 0). This is the reason that the images you see in Figure 18 are all free from noise. I do acknowledge that this might not be the best of a result since not all the generated images are valid digit numbers. — But hey, this instead indicates that these images are not merely duplicates from the original dataset.

Here we can also visualize the backward diffusion process using the Codeblock 27 below. You can see in the resulting output in Figure 19 that we initially start from a complete random noise, which gradually disappears as we move to the right.

# Codeblock 27
fig, axes = plt.subplots(ncols=10, figsize=(24, 8))

sample_no = 0
timestep_no = 0

for i in range(10):
    axes[i].imshow(denoised_images[timestep_no][sample_no].squeeze().detach().cpu().numpy(), cmap='gray')
    axes[i].get_xaxis().set_visible(False)
    axes[i].get_yaxis().set_visible(False)
    timestep_no += 1

plt.show()

Figure 19. What the image looks like at timestep 900, 800, 700 and so on until timestep 0 [3].

Ending

There are plenty of directions you can go from here. First, you might probably need to tweak the parameter configurations in Codeblock 2 if you want better results. Second, it is also possible to modify the U-Net model by implementing attention layers in addition to the stack of convolution layers we used in the downsampling and the upsampling stages. This does not guarantee you to obtain better results especially for a simple dataset like this, but it’s definitely worth trying. Third, you can also try to use a more complex dataset if you want to challenge yourself.

When it comes to practical applications, there are actually lots of things you can do with diffusion models. The simplest one might be for data augmentation. With diffusion model, we can easily generate new images from a specific data distribution. For example, suppose we are working on an image classification project, but the number of images in the classes are imbalanced. To address this problem, it is possible for us to take the images from the minority class and feed them into a diffusion model. By doing so, we can ask the trained diffusion model to generate a number of samples from that class as many as we want.

And well, that’s pretty much everything about the theory and the implementation of diffusion model. Thanks for reading, I hope you learn something new today!

You can access the code used in this project through this link. Here are also the links to my previous articles about Autoencoder, Variational Autoencoder (VAE), Neural Style Transfer (NST), and Transformer.

References

[1] Jascha Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Arxiv. https://arxiv.org/pdf/1503.03585 [Accessed December 27, 2024].

[2] Jonathan Ho et al. Denoising Diffusion Probabilistic Models. Arxiv. https://arxiv.org/pdf/2006.11239 [Accessed December 27, 2024].

[3] Image created originally by author.

[4] Olaf Ronneberger et al. U-Net: Convolutional Networks for Biomedical
Image Segmentation. Arxiv. https://arxiv.org/pdf/1505.04597 [Accessed December 27, 2024].

[5] Yann LeCun et al. The MNIST Database of Handwritten Digits. https://yann.lecun.com/exdb/mnist/ [Accessed December 30, 2024] (Creative Commons Attribution-Share Alike 3.0 license).

[6] Ashish Vaswani et al. Attention Is All You Need. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed September 29, 2024].

The post The Art of Noise appeared first on Towards Data Science.

Show and Tell

Muhammad Ardi — Mon, 03 Feb 2025 16:30:24 +0000

Photo by Ståle Grut on Unsplash

Introduction

Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision, directs me to a completely different world. Over time, we can now observe that AI becomes more and more advanced, with the intersection between multiple fields of study getting more common, including the two I just mentioned.

Today, many language models have capability to generate images based on the given prompt. That’s one example of the bridge between NLP and Computer Vision. But I guess I’ll save it for my upcoming article as it is a bit more complex. Instead, in this article I am going to discuss the simpler one: image captioning. As the name suggests, this is essentially a technique where a specific model accepts an image and returns a text that describes the input image.

One of the earliest papers in this topic is the one titled "Show and Tell: A Neural Image Caption Generator" written by Vinyals et al. back in 2015 [1]. In this article, I will focus on implementing the Deep Learning model proposed in the paper using PyTorch. Note that I won’t actually demonstrate the training process here as that’s a topic on its own. Let me know in the comments if you want a separate tutorial on that.

Image Captioning Framework

Generally speaking, image captioning can be done by combining two types of models: the one specialized to process images and another one capable of processing sequences. I believe you already know what kind of models work best for the two tasks – yes, you’re right, those are CNN and RNN, respectively. The idea here is that the CNN is utilized to encode the input image (hence this part is called encoder), whereas the RNN is used for generating a sequence of words based on the features encoded by the CNN (hence the RNN part is called decoder).

It is discussed in the paper that the authors attempted to do so using GoogLeNet (a.k.a., Inception V1) for the encoder and LSTM for the decoder. In fact, the use of GoogLeNet is not explicitly mentioned, yet based on the illustration provided in the paper it seems like the architecture used in the encoder is adopted from the original GoogLeNet paper [2]. The figure below shows what the proposed architecture looks like.

Figure 1. The image captioning model proposed in [1], where the encoder part (the leftmost block) implements the GoogLeNet model [2].

Talking more specifically about the connection between the encoder and the decoder, there are several methods available for connecting the two, namely init-inject, pre-inject, par-inject and merge, as mentioned in [3]. In the case of the Show and Tell paper, authors used pre-inject, a method where the features extracted by the encoder are perceived as the 0th word in the caption. Later in the inference phase, we expect the decoder to generate a caption based solely on these image features.

Figure 2. The four methods possible to be used to connect the encoder and the decoder part of an image captioning model [3]. In our case we are going to use the pre-inject method (b).

As we already understood the theory behind the image captioning model, we can now jump into the code!

Implementation

I’ll break the implementation part into three sections: the Encoder, the Decoder, and the combination of the two. Before we actually get into them, we need to import the modules and initialize the required parameters in advance. Look at the Codeblock 1 below to see the modules I use.

# Codeblock 1
import torch  #(1)
import torch.nn as nn  #(2)
import torchvision.models as models  #(3)
from torchvision.models import GoogLeNet_Weights  #(4)

Let’s break down these imports quickly: the line marked with #(1) is used for basic operations, line #(2) is for initializing neural network layers, line #(3) is for loading various deep learning models, and #(4) is the pretrained weights for the GoogLeNet model.

Talking about the parameter configuration, EMBED_DIM and LSTM_HIDDEN_DIM are the only two parameters mentioned in the paper, which are both set to 512 as shown at line #(1) and #(2) in the Codeblock 2 below. The EMBED_DIM variable essentially indicates the feature vector size representing a single token in the caption. In this case, we can simply think of a single token as an individual word. Meanwhile, LSTM_HIDDEN_DIM is a variable representing the hidden state size inside the LSTM cell. This paper does not mention how many times this RNN-based layer is repeated, but based on the diagram in Figure 1, it seems like it only implements a single LSTM cell. Thus, at line #(3) I set the NUM_LSTM_LAYERS variable to 1.

# Codeblock 2
EMBED_DIM       = 512    #(1)
LSTM_HIDDEN_DIM = 512    #(2)
NUM_LSTM_LAYERS = 1      #(3)

IMAGE_SIZE      = 224    #(4)
IN_CHANNELS     = 3      #(5)

SEQ_LENGTH      = 30     #(6)
VOCAB_SIZE      = 10000  #(7)

BATCH_SIZE      = 1

The next two parameters are related to the input image, namely IMAGE_SIZE (#(4)) and IN_CHANNELS (#(5)). Since we are about to use GoogLeNet for the encoder, we need to match it with its original input shape (3×224×224). Not only for the image, but we also need to configure the parameters for the caption. Here we assume that the caption length is no more than 30 words (#(6)) and the number of unique words in the dictionary is 10000 (#(7)). Lastly, the BATCH_SIZE parameter is used because by default PyTorch processes tensors in a batch. Just to make things simple, the number of image-caption pair within a single batch is set to 1.

GoogLeNet Encoder

It is actually possible to use any kind of CNN-based model for the encoder. I found on the internet that [4] uses DenseNet, [5] uses Inception V3, and [6] utilizes ResNet for the similar tasks. However, since my goal is to reproduce the model proposed in the paper as closely as possible, I am using the pretrained GoogLeNet model instead. Before we get into the encoder implementation, let’s see what the GoogLeNet architecture looks like using the following code.

# Codeblock 3
models.googlenet()

The resulting output is very long as it lists literally all layers inside the architecture. Here I truncate the output since I only want you to focus on the last layer (the fc layer marked with #(1) in the Codeblock 3 Output below). You can see that this linear layer maps a feature vector of size 1024 into 1000. Normally, in a standard image classification task, each of these 1000 neurons corresponds to a specific class. So, for example, if you want to perform a 5-class classification task, you would need to modify this layer such that it projects the outputs to 5 neurons only. In our case, we need to make this layer produce a feature vector of length 512 (EMBED_DIM). With this, the input image will later be represented as a 512-dimensional vector after being processed by the GoogLeNet model. This feature vector size will exactly match with the token embedding dimension, allowing it to be treated as a part of our word sequence.

# Codeblock 3 Output
GoogLeNet(
  (conv1): BasicConv2d(
    (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (maxpool1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True)
  (conv2): BasicConv2d(
    (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )

  .
  .
  .
  .

  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=1024, out_features=1000, bias=True)  #(1)
)

Now let’s actually load and modify the GoogLeNet model, which I do in the InceptionEncoder class below.

# Codeblock 4a
class InceptionEncoder(nn.Module):
    def __init__(self, fine_tune):  #(1)
        super().__init__()
        self.googlenet = models.googlenet(weights=GoogLeNet_Weights.IMAGENET1K_V1)  #(2)
        self.googlenet.fc = nn.Linear(in_features=self.googlenet.fc.in_features,  #(3)
                                      out_features=EMBED_DIM)  #(4)

        if fine_tune == True:       #(5)
            for param in self.googlenet.parameters():
                param.requires_grad = True
        else:
            for param in self.googlenet.parameters():
                param.requires_grad = False

        for param in self.googlenet.fc.parameters():
            param.requires_grad = True

The first thing we do in the above code is to load the model using models.googlenet(). It is mentioned in the paper that the model is already pretrained on the ImageNet dataset. Thus, we need to pass GoogLeNet_Weights.IMAGENET1K_V1 into the weights parameter, as shown at line #(2) in Codeblock 4a. Next, at line #(3) we access the classification head through the fc attribute, where we replace the existing linear layer with a new one having the output dimension of 512 (EMBED_DIM) (#(4)). Since this GoogLeNet model is already trained, we don’t need to train it from scratch. Instead, we can either perform fine-tuning or transfer learning in order to adapt it to the image captioning task.

In case you’re not yet familiar with the two terms, fine-tuning is a method where we update the weights of the entire model. On the other hand, transfer learning is a technique where we only update the weights of the layers we replaced (in this case it’s the last fully-connected layer), while setting the weights of the existing layers non-trainable. To do so, I implement a flag named fine_tune at line #(1) which will let the model to perform fine-tuning whenever it is set to True (#(5)).

The forward() method is pretty straightforward since what we do here is simply passing the input image through the modified GoogLeNet model. See the Codeblock 4b below for the details. Additionally, here I also print out the tensor dimension before and after processing so that you can better understand how the InceptionEncoder model works.

# Codeblock 4b
    def forward(self, images):
        print(f'originalt: {images.size()}')
        features = self.googlenet(images)
        print(f'after googlenett: {features.size()}')

        return features

To test whether our decoder works properly, we can pass a dummy tensor of size 1×3×224×224 through the network as demonstrated in Codeblock 5. This tensor dimension simulates a single RGB image of size 224×224. You can see in the resulting output that our image now becomes a single-dimensional feature vector with the length of 512.

# Codeblock 5
inception_encoder = InceptionEncoder(fine_tune=True)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
features = inception_encoder(images)

# Codeblock 5 Output
original         : torch.Size([1, 3, 224, 224])
after googlenet  : torch.Size([1, 512])

LSTM Decoder

As we have successfully implemented the encoder, now that we are going to create the LSTM decoder, which I demonstrate in Codeblock 6a and 6b. What we need to do first is to initialize the required layers, namely an embedding layer (#(1)), the LSTM layer itself (#(2)), and a standard linear layer (#(3)). The first one (nn.Embedding) is responsible for mapping every single token into a 512 (EMBED_DIM)-dimensional vector. Meanwhile, the LSTM layer is going to generate a sequence of embedded tokens, where each of these tokens will be mapped into a 10000 (VOCAB_SIZE)-dimensional vector by the linear layer. Later on, the values contained in this vector will represent the likelihood of each word in the dictionary being chosen.

# Codeblock 6a
class LSTMDecoder(nn.Module):
    def __init__(self):
        super().__init__()

        #(1)
        self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
                                      embedding_dim=EMBED_DIM)
        #(2)
        self.lstm = nn.LSTM(input_size=EMBED_DIM, 
                            hidden_size=LSTM_HIDDEN_DIM, 
                            num_layers=NUM_LSTM_LAYERS, 
                            batch_first=True)
        #(3)        
        self.linear = nn.Linear(in_features=LSTM_HIDDEN_DIM, 
                                out_features=VOCAB_SIZE)

Next, let’s define the flow of the network using the following code.

# Codeblock 6b
    def forward(self, features, captions):                 #(1)
        print(f'features originalt: {features.size()}')
        features = features.unsqueeze(1)                   #(2)
        print(f"after unsqueezett: {features.shape}")

        print(f'captions originalt: {captions.size()}')
        captions = self.embedding(captions)                #(3)
        print(f"after embeddingtt: {captions.shape}")

        captions = torch.cat([features, captions], dim=1)  #(4)
        print(f"after concattt: {captions.shape}")

        captions, _ = self.lstm(captions)                  #(5)
        print(f"after lstmtt: {captions.shape}")

        captions = self.linear(captions)                   #(6)
        print(f"after lineartt: {captions.shape}")

        return captions

You can see in the above code that the forward() method of the LSTMDecoder class accepts two inputs: features and captions, where the former is the image that has been processed by the InceptionEncoder, while the latter is the caption of the corresponding image serving as the ground truth (#(1)). The idea here is that we are going to perform pre-inject operation by prepending the features tensor into captions using the code at line #(4). However, keep in mind that we need to adjust the shape of both tensors beforehand. To do so, we have to insert a single dimension at the 1st axis of the image features (#(2)). Meanwhile, the shape of the captions tensor will align with our requirement right after being processed by the embedding layer (#(3)). As the features and captions have been concatenated, we then pass this tensor through the LSTM layer (#(5)) before it is eventually processed by the linear layer (#(6)). Look at the testing code below to better understand the flow of the two tensors.

# Codeblock 7
lstm_decoder = LSTMDecoder()

features = torch.randn(BATCH_SIZE, EMBED_DIM)  #(1)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(2)

captions = lstm_decoder(features, captions)

In Codeblock 7, I assume that features is a dummy tensor that represents the output of the InceptionEncoder model (#(1)). Meanwhile, captions is the tensor representing a sequence of tokenized words, where in this case I initialize it as random numbers ranging between 0 to 10000 (VOCAB_SIZE) with the length of 30 (SEQ_LENGTH) (#(2)).

We can see in the output below that the features tensor initially has the dimension of 1×512 (#(1)). This tensor shape changed to 1×1×512 after being processed with the unsqueeze() operation (#(2)). The additional dimension in the middle (1) allows the tensor to be treated as a feature vector corresponding to a single timestep, which is necessary for compatibility with the LSTM layer. To the captions tensor, its shape changed from 1×30 (#(3)) to 1×30×512 (#(4)), indicating that every single word is now represented as a 512-dimensional vector.

# Codeblock 7 Output
features original : torch.Size([1, 512])       #(1)
after unsqueeze   : torch.Size([1, 1, 512])    #(2)
captions original : torch.Size([1, 30])        #(3)
after embedding   : torch.Size([1, 30, 512])   #(4)
after concat      : torch.Size([1, 31, 512])   #(5)
after lstm        : torch.Size([1, 31, 512])   #(6)
after linear      : torch.Size([1, 31, 10000]) #(7)

After pre-inject operation is performed, our tensor is now having the dimension of 1×31×512, where the features tensor becomes the token at the 0th timestep in the sequence (#(5)). See the following figure to better illustrate this idea.

Figure 3. What the resulting tensor looks like after the pre-injection operation. [3].

Next, we pass the tensor through the LSTM layer, which in this particular case the output tensor dimension remains the same. However, it is important to note that the tensor shapes at line #(5) and #(6) in the above output are actually specified by different parameters. The dimensions appear to match here because EMBED_DIM and LSTM_HIDDEN_DIM were both set to 512. Normally, if we use a different value for LSTM_HIDDEN_DIM, then the output dimension is going to be different as well. Finally, we projected each of the 31 token embeddings to a vector of size 10000, which will later contain the probability of every possible token being predicted (#(7)).

GoogLeNet Encoder + LSTM Decoder

At this point, we have successfully created both the encoder and the decoder parts of the image captioning model. What I am going to do next is to combine them together in the ShowAndTell class below.

# Codeblock 8a
class ShowAndTell(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = InceptionEncoder(fine_tune=True)  #(1)
        self.decoder = LSTMDecoder()     #(2)

    def forward(self, images, captions):
        features = self.encoder(images)  #(3)
        print(f"after encodert: {features.shape}")

        captions = self.decoder(features, captions)      #(4)
        print(f"after decodert: {captions.shape}")

        return captions

I think the above code is pretty straightforward. In the __init__() method, we only need to initialize the InceptionEncoder as well as the LSTMDecoder models (#(1) and #(2)). Here I assume that we are about to perform fine-tuning rather than transfer learning, so I set the fine_tune parameter to True. Theoretically speaking, fine-tuning is better than transfer learning if you have a relatively large dataset since it works by re-adjusting the weights of the entire model. However, if your dataset is rather small, you should go with transfer learning instead – but that’s just the theory. It’s definitely a good idea to experiment with both options to see which works best in your case.

Still with the above codeblock, we configure the forward() method to accept image-caption pairs as input. With this configuration, we basically design this method such that it can only be used for training purpose. Here we initially process the raw image with the GoogLeNet inside the encoder block (#(3)). Afterwards, we pass the extracted features as well as the tokenized captions into the decoder block and let it produce another token sequence (#(4)). In the actual training, this caption output will then be compared with the ground truth to compute the error. This error value is going to be used to compute gradients through backpropagation, which determines how the weights in the network are updated.

It is important to know that we cannot use the forward() method to perform inference, so we need a separate one for that. In this case, I am going to implement the code specifically to perform inference in the generate() method below.

# Codeblock 8b
    def generate(self, images):  #(1)
        features = self.encoder(images)              #(2)
        print(f"after encodertt: {features.shape}n")

        words = []  #(3)
        for i in range(SEQ_LENGTH):                  #(4)
            print(f"iteration #{i}")
            features = features.unsqueeze(1)
            print(f"after unsqueezett: {features.shape}")

            features, _ = self.decoder.lstm(features)
            print(f"after lstmtt: {features.shape}")

            features = features.squeeze(1)           #(5)
            print(f"after squeezett: {features.shape}")

            probs = self.decoder.linear(features)    #(6)
            print(f"after lineartt: {probs.shape}")

            _, word = probs.max(dim=1)  #(7)
            print(f"after maxtt: {word.shape}")

            words.append(word.item())  #(8)

            if word == 1:  #(9)
                break

            features = self.decoder.embedding(word)  #(10)
            print(f"after embeddingtt: {features.shape}n")

        return words       #(11)

Instead of taking two inputs like the previous one, the generate() method takes raw image as the only input (#(1)). Since we want the features extracted from the image to be the initial input token, we first need to process the raw input image with the encoder block prior to actually generating the subsequent tokens (#(2)). Next, we allocate an empty list for storing the token sequence to be produced later (#(3)). The tokens themselves are generated one by one, so we wrap the entire process inside a for loop, which is going to stop iterating once it reaches at most 30 (SEQ_LENGTH) words (#(4)).

The steps done inside the loop is algorithmically similar to the ones we discussed earlier. However, since the LSTM cell here generates a single token at a time, the process requires the tensor to be treated a bit differently from the one passed through the forward() method of the LSTMDecoder class back in Codeblock 6b. The first difference you might notice is the squeeze() operation (#(5)), which is basically just a technical step to be done such that the subsequent layer does the linear projection correctly (#(6)). Then, we take the index of the feature vector having the highest value, which corresponds to the token most likely to come next (#(7)), and append it to the list we allocated earlier (#(8)). The loop is going to break whenever the predicted index is a stop token, which in this case I assume that this token is at the 1st index of the probs vector. Otherwise, if the model does not find the stop token, then it is going to convert the last predicted word into its 512 (EMBED_DIM)-dimensional vector (#(10)), allowing it to be used as the input features for the next iteration. Lastly, the generated word sequence will be returned once the loop is completed (#(11)).

We are going to simulate the forward pass for the training phase using the Codeblock 9 below. Here I pass two tensors through the show_and_tell model (#(1)), each representing a raw image of size 3×224×224 (#(2)) and a sequence of tokenized words (#(3)). Based on the resulting output, we found that our model works properly as the two input tensors successfully passed through the InceptionEncoder and the LSTMDecoder part of the network.

# Codeblock 9
show_and_tell = ShowAndTell()  #(1)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))      #(3)

captions = show_and_tell(images, captions)

# Codeblock 9 Output
after encoder : torch.Size([1, 512])
after decoder : torch.Size([1, 31, 10000])

Now, let’s assume that our show_and_tell model is already trained on an image captioning dataset, and thus ready to be used for inference. Look at the Codeblock 10 below to see how I do it. Here we set the model to eval() mode (#(1)), initialize the input image (#(2)), and pass it through the model using the generate() method (#(3)).

# Codeblock 10
show_and_tell.eval()  #(1)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)

with torch.no_grad():
    generated_tokens = show_and_tell.generate(images)  #(3)

The flow of the tensor can be seen in the output below. Here I truncate the resulting outputs because it only shows the same token generation process 30 times.

# Codeblock 10 Output
after encoder    : torch.Size([1, 512])

iteration #0
after unsqueeze  : torch.Size([1, 1, 512])
after lstm       : torch.Size([1, 1, 512])
after squeeze    : torch.Size([1, 512])
after linear     : torch.Size([1, 10000])
after max        : torch.Size([1])
after embedding  : torch.Size([1, 512])

iteration #1
after unsqueeze  : torch.Size([1, 1, 512])
after lstm       : torch.Size([1, 1, 512])
after squeeze    : torch.Size([1, 512])
after linear     : torch.Size([1, 10000])
after max        : torch.Size([1])
after embedding  : torch.Size([1, 512])

.
.
.
.

To see what the resulting caption looks like, we can just print out the generated_tokens list as shown below. Keep in mind that this sequence is still in the form of tokenized words. Later, in the post-processing stage, we will need to convert them back to the words corresponding to these numbers.

# Codeblock 11
generated_tokens

# Codeblock 11 Output
[5627,
 3906,
 2370,
 2299,
 4952,
 9933,
 402,
 7775,
 602,
 4414,
 8667,
 6774,
 9345,
 8750,
 3680,
 4458,
 1677,
 5998,
 8572,
 9556,
 7347,
 6780,
 9672,
 2596,
 9218,
 1880,
 4396,
 6168,
 7999,
 454]

Ending

With the above output, we’ve reached the end of our discussion on image captioning. Over time, many other researchers attempted to make improvements to accomplish this task. So, I think in the upcoming article I will discuss the state-of-the-art method on this topic.

Thanks for reading, I hope you learn something new today!

_By the way you can also find the code used in this article here._

References

[1] Oriol Vinyals et al. Show and Tell: A Neural Image Caption Generator. Arxiv. https://arxiv.org/pdf/1411.4555 [Accessed November 13, 2024].

[2] Christian Szegedy et al. Going Deeper with Convolutions. Arxiv. https://arxiv.org/pdf/1409.4842 [Accessed November 13, 2024].

[3] Marc Tanti et al. Where to put the Image in an Image Caption Generator. Arxiv. https://arxiv.org/pdf/1703.09137 [Accessed November 13, 2024].

[4] Stepan Ulyanin. Captioning Images with CNN and RNN, using PyTorch. Medium. https://medium.com/@stepanulyanin/captioning-images-with-pytorch-bc592e5fd1a3 [Accessed November 16, 2024].

[5] Saketh Kotamraju. How to Build an Image-Captioning Model in Pytorch. Towards Data Science. https://towardsdatascience.com/how-to-build-an-image-captioning-model-in-pytorch-29b9d8fe2f8c [Accessed November 16, 2024].

[6] Code with Aarohi. Image Captioning using CNN and RNN | Image Captioning using Deep Learning. YouTube. https://www.youtube.com/watch?v=htNmFL2BG34 [Accessed November 16, 2024].

The post Show and Tell appeared first on Towards Data Science.

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Shirley Li — Fri, 31 Jan 2025 10:02:05 +0000

Image created by author using ChatGPT.

This is the first article of our new series "Deepseek-V3 Explained", where we will try to demystify DeepSeek-V3 [1, 2], the latest model open-sourced by DeepSeek.

In this series, we aim to cover two major topics:

Major architecture innovations in DeepSeek-V3, including MLA (Multi-head Latent Attention) [3], DeepSeekMoE [4], auxiliary-loss-free load balancing [5], and multi-token prediction training.
Training of DeepSeek-V3, including pre-training, finetuning and RL alignment phases.

This article mainly focuses on Multi-head Latent Attention, which was first proposed in the development of DeepSeek-V2 and then used in DeepSeek-V3 as well.

Table of contents:

Background: we start from the standard MHA, explain why we need Key-Value cache at inference stage, how MQA and GQA try to optimize it, and how RoPE works, etc.
Multi-head Latent Attention: An in-depth introduction to MLA, including its motivations, why decoupled RoPE is needed, and its performance.
References.

Background

To better understand MLA and also make this article self-contained, we will revisit several related concepts in this section before diving into the details of MLA.

MHA in Decoder-only Transformers

Note that MLA is developed to speedup inference speed in autoregressive text generation, so the MHA we are talking about under this context is for decoder-only Transformer.

The figure below compares three Transformer architectures used for decoding, where (a) shows both the encoder and decoder proposed in the original "Attention is All You Need" paper. Its decoder part is then simplified by [6], leading to a decoder-only Transformer model shown in (b), which is later used in many generation models like GPT [8].

Nowadays, LLMs are more commonly to choose the structure shown in (c) for more stable training, with normalization applied on the input rather then output, and LayerNorm upgraded to RMS Norm. This will serve as the baseline architecture we will discuss in this article.

Figure 1. Transformer architectures. (a) encoder-decoder proposed in [6]. (b) Decoder-only Transformer proposed in [7] and used in GPT [8]. (c) An optimized version of (b) with RMS Norm before attention. [3]

Within this context, MHA calculation largely follows the process in [6], as shown in the figure below:

Figure 2. Scaled dot-product attention vs. Multi-Head Attention. Image from [6].

Assume we have n_h attention heads, and the dimension for each attention head is represented as d_h, so that the concatenated dimension will be (h_n · d_h).

Given a model with l layers, if we denote the input for the t-th token in that layer as h_t with dimension d, we need to map the dimension of h_t from d to (h_n · d_h) using the linear mapping matrices.

More formally, we have (equations from [3]):

where W^Q, W^K and W^V are the linear mapping matrices:

After such mapping, q_t, k_t and v_t will be split into n_h heads to calculate the scaled dot-product attention:

where W^O is another projection matrix to map the dimension inversely from (h_n · d_h) to d:

Note that the process described by Eqn.(1) to (8) above is just for a single token. During inference, we need to repeat this process for each newly generated token, which involves a lot of repeated calculation. This leads to a technique called Key-Value cache.

Key-Value Cache

As suggested by its name, Key-Value cache is a technique designed to speedup the autoregressive process by caching and reusing the previous keys and values, rather than re-computing them at each decoding step.

Note that KV cache is typically used only during the inference stage, since in training we still need to process the entire input sequence in parallel.

KV cache is commonly implemented as a rolling buffer. At each decoding step, only the new query Q is computed, while the K and V stored in the cache will be reused, so that the attention will be computed using the new Q and reused K, V. Meanwhile, the new token’s K and V will also be appended to the cache for later use.

However, the speedup achieved by KV cache comes at a cost of memory, since KV cache often scales with batch size × sequence length × hidden size × number of heads, leading to a memory bottleneck when we have larger batch size or longer sequences.

That further leads to two techniques aiming at addressing this limitation: Multi-Query Attention and Grouped-Query Attention.

Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

The figure below shows the comparison between the original MHA, Grouped-Query Attention (GQA) [10] and Multi-Query Attention (MQA) [9].

Figure 3. MHA [6], GQA [10] AND MQA [9]. Image from [10].

The basic idea of MQA is to share a single key and a single value head across all query heads, which can significantly reduce memory usage but will also impact the accuracy of attention.

GQA can be seen as an interpolating method between MHA and MQA, where a single pair of key and value heads will be shared only by a group of query heads, not all queries. But still this will lead to inferior results compared to MHA.

In the later sections, we will see how MLA manages to seek a balance between memory efficiency and modeling accuracy.

RoPE (Rotary Positional Embeddings)

One last piece of background we need to mention is RoPE [11], which encodes positional information directly into the attention mechanism by rotating the query and key vectors in multi-head attention using sinusoidal functions.

More specifically, RoPE applies a position-dependent rotation matrix to the query and key vectors at each token, and uses sine and cosine functions for its basis but applies them in a unique way to achieve rotation.

To see what makes it position-dependent, consider a toy embedding vector with only 4 elements, i.e., (x_1, x_2, x_3, x_4).

To apply RoPE, we firstly group consecutive dimensions into pairs:

(x_1, x_2) -> position 1
(x_3, x_4) -> position 2

Then, we apply a rotation matrix to rotate each pair:

Figure 4. Illustration of the rotation matrix applied to a pair of tokens. Image by author.

where θ = θ(p) = p ⋅ θ_0, and θ_0 is a base frequency. In our 4-d toy example, this means that (x_1, x_2) will be rotated by θ_0, and (x_3, x_4) will be rotated by 2 ⋅ θ_0.

This is why we call this rotation matrix as position-dependent: at each position (or each pair), we will apply a different rotation matrix where the rotation angle is determined by position.

RoPE is widely used in modern LLMs due to its efficiency in encoding long sequences, but as we can see from the above formula, it is position-sensitive to both Q and K, making it incompatible with MLA in some ways.

Multi-head Latent Attention

Finally we can move on to the MLA part. In this section we will first layout the high-level idea of MLA, and then dive deeper into why it needs to modify RoPE. Finally, we present the detailed algorithm of MLA as well as it performance.

MLA: High-level Idea

The basic idea of MLA is to compress the attention input h_t into a low-dimensional latent vector with dimension d_c, where d_c is much lower than the original (h_n · d_h). Later when we need to calculate attention, we can map this latent vector back to the high-dimensional space to recover the keys and values. As a result, only the latent vector needs to be stored, leading to significant memory reduction.

This process can be more formally described with the following equations, where c^{KV}_t is the latent vector, W^{DKV} is the compressing matrix that maps h_t‘s dimension from (h_n · d_h) to d_c (here D in the superscript stands for "down-projection", meaning compressing the dimension), while W^{UK} and W^{UV} are both up-projection matrices that map the shared latent vector back to the high-dimensional space.

Similarly, we can also map the queries into a latent, low-dimensional vector and then map it back to the original, high-dimensional space:

Why Decoupled RoPE is Needed

As we mentioned before, RoPE is a common choice for training generation models to handle long sequences. If we directly apply the above MLA strategy, that will be incompatible with RoPE.

To see this more clearly, consider what happens when we calculate attention using Eqn. (7): when we multiply the transposed q with k, the matrices W^Q and W^{UK} will appear in the middle, and their combination equivalents to a single mapping dimension from d_c to d.

In the original paper [3], the authors describe this as W^{UK} can be "absorbed" into W^Q, as a result we do not need to store W^{UK} in the cache, further reducing memory usage.

However, this will not be the case when we take the rotation matrix in Figure (4) into consideration, as RoPE will apply a rotation matrix on the left of W^{UK}, and this rotation matrix will end up in between the transposed W^Q and W^{UK}.

As we have explained in the background part, this rotation matrix is position-dependent, meaning the rotation matrix for each position is different. As a result, W^{UK} can no longer be absorbed by W^Q.

To address this conflict, the authors propose what they call "a decoupled RoPE", by introducing additional query vectors along with a shared key vector, and use these additional vectors in the RoPE process only, while keeping the original keys kind of isolated with the rotation matrix.

The entire process of MLA can be summarized as below (equation numbers reused from Appendix C in [3]):

Figure 5. MLA process. Image edited by author based on equations in [3].

where

Eqn. (37) to (40) describe how to process query tokens.
Eqn. (41) and (42) describe how to process key tokens.
Eqn. (43) and (44) describe how to use the additional shared key for RoPE, be aware that the output of (42) is not involved in RoPE.
Eqn. (45) describes how to process value tokens.

During this process, only the blue variables with boxes need to be cached. This process can be illustrated more clearly with the flowchart blow:

Figure 6. Flowchart of MLA. Image from [3].

Performance of MLA

The table below compares the number of elements needed for KV cache (per token) as well as the modeling capacity between MHA, GQA, MQA and MLA, demonstrating that MLA could indeed achieve a better balance between memory efficiency vs. modeling capacity.

Interestingly, the modeling capacity for MLA even surpass that of the original MHA.

Table 1 from [3].

More specifically, the table below shows the performance of MHA, GQA and MQA on 7B models, where MHA significantly outperforms both MQA and GQA.

Table 8 from [3].

The authors of [3] also conduct analysis between MHA vs. MLA, and results are summarized in the table below, where MLA achieves better results overall.

Table 9 in [3].

References

The post DeepSeek-V3 Explained 1: Multi-head Latent Attention appeared first on Towards Data Science.

The Three Phases of Learning Machine Learning

Pascal Janetzky — Tue, 28 Jan 2025 13:02:11 +0000

After 6+ years of experience with machine learning, both in research and industry, it’s very interesting to see how far the field has gotten over the years. I still remember sitting in seminar rooms, listening to lectures on everything ML: deep learning, reinforcement learning, random forests, neural networks, natural language processing, …

Photo by Vlad Bagacian on Unsplash

One particular NLP lecture stands out in my memory, where we discussed the rapid advancements in the field. We were discussing vanilla attention mechanisms the week before and were now looking at Transformer-based approaches. The great tutor showed us graphs with the parameter counts of the models. We then looked up the then-recent advances, and it became clear: any figure is outdated within a month. That was when my ML journey had barely started, and already the amount of innovation that was published again and again was insane.

Since then, my journey proceeded in several phases. From exchange with other ML people, it appears that they have a similar experience: From a beginner to seasoned practitioner, the journey can be divided into the three following phases:

Phase I: Beginner (~1 year; this article) Phase II: Intermediate (~1 to 3 years; upcoming) Phase III: Advanced (~5+ years; upcoming)

The beginner phase of learning Machine Learning

This article’s focus is on the first phase in your machine learning journey: the beginner phase. To help you start in ML, you can use the checklist below to track your progress (also see section Resources).

The beginner phase is everyone’s starting stage. At this stage, your primary goal is to understand the fundamental ideas behind machine learning: how machines learn from data and the essential algorithms that power ML models. You’ll dive into topics like decision trees, k-nearest neighbors (KNN), and linear regression. These concepts serve as the building blocks for your further ML journey.

At the start of your journey, it probably makes sense to get an overview of the field of Deep Learning. Here, I can recommend MIT’s free Introduction to Deep Learning course, see Resources. It is an intense lecture showcasing all areas of (modern) deep machine learning. It allows you to see what’s awaiting you of the next year(s).

Practically speaking, we can divide the beginner phase into five categories.

Data Handling
Classic Machine Learning
Neural Networks
Theory
Miscellaneous Skills

Data Handling

In this category, you’ll learn how to work with small datasets that fit easily into your computer’s memory. This is good, as you do not need to rent expensive compute online. Often, these datasets are readily available within ML frameworks like PyTorch or TensorFlow. Common examples of small-sized datasets include:

Images: MNIST (handwritten digits)
Audio: ESC-50 environmental sound recordings
Time-series: Heartbeat timeseries categorization
Text: IMDB movie reviews for sentiment analysis

The advantage of these beginner datasets is that they are well-documented and easy to manipulate. Usually, the preprocessing you have to do for them is just resizing images or trimming sentences, which you can handle with just a few lines of code. This lets you focus on learning the core concepts without having to preprocess larger-than-memory datasets.

Should you ever need more compute, then try Google Colab. It’s a free offering by Google that let’s you run Jupyter Notebooks (i.e., interactive python code) directly in your browser. See Resources.

Classic Machine Learning

While the focus of this post is on deep machine learning – as opposed to classic machine learning – it’s useful to learn the time-proven basics as well. Among these classic ML techniques, a few stand out:

Regression: The goal here is to predict continuous values based on input data. The Boston House pricing dataset is a often-used dataset for regression; the task is to predict house prices based on features like square footage, location, etc.
Clustering: This is about grouping similar data points together. K-means clustering is a great entry point for beginners, and I recently read a paper, where the authors combined advanced deep learning techniques with k-means clustering.
Support Vector Machines (SVM): SVMs focus on finding the best decision boundary (hyperplane, essentially a border in n-dimensional space) to separate data into distinct classes. They are often used for classification datasets.

In general, though most of the recent advancements are often the result of deep learning techniques, classic machine learning still is relevant. You do not always need the advanced ML techniques, sometimes the basics suffice.

Neural Networks

Once you’re comfortable with the basics of classic ML, you’ll begin transitioning to neural networks (NNs), the foundation of deep learning. I recommend starting with dense neural networks, which consist of multiple fully connected layers, or linear layers in PyTorch. In these layers, each input is multiplied by a weight and combined with a bias to produce an output. You can use these networks for small and mid-sized datasets alike.

After dense neural networks, you’ll dive into convolutional neural networks (CNNs). These type of network are essential for tasks involving image data. CNNs use convolution operations to identify features like edges, textures, or shapes in images, regardless of their position in the image. This location-independent ability makes CNNs a very versatile approach

Lastly, for sequential data (such as time series) I recommend playing with recurrent neural networks (RNNs). RNNs are designed to retain information from previous steps, making them ideal for tasks like language modeling or time-series forecasting. In fact, earlier machine learning research (around 2014) primarily used RNNs for machine translation, see Resources.

Theory

Together with hands-on learning, it’s essential to develop a theoretical understanding of the methods you’re using. This includes mathematical notation, matrix operations, and the underlying principles of machine learning algorithms.

For instance, understanding Σ notation (summation) helps make complex mathematical expressions more concise and readable. Though they require some learning, using this way to express operations can help avoiding disambiguities, a downside of communicating in natural language. With practice, the math behind algorithms will begin to make more sense.

At this stage, you’ll also encounter key metrics used to evaluate ML models, such as accuracy, mean squared error, precision, and recall. These metrics will help you measure your models’ performance.

Miscellaneous Skills

As you are studying classic machine learning, neural networks, and the theory, you will naturally also develop practical skills in the beginner phase. These skills include programming, model management, and virtual environments.

For programming, Python is the go-to language for ML, and its large ecosystem of libraries (like NumPy, pandas, Scikit-learn, etc.) will be invaluable.

The features these libraries provide come in handy once you analyze your dataset. For example, what’s the value range? Are there some outliers? How extreme are they? Are there unusable data samples? These are some questions that you want to answer, which usually happens on the go:

You write a short code snippet – it breaks because a data sample is different – you update the code – you have learned more about your data.

Basically any time your are honing your data analysis skills, you will be improving your coding skills as well.

Model management, on the other hand, means saving, loading, and sharing trained models. This is relevant when you are training you neural network in one script, but evaluate it in a separate one.

Lastly, everything you do on the code level requires a Python interpreter, a program that helps you execute your Python code. As you will have several projects, I recommend getting familiar with virtual environments. These "encapsulate" all the requirements and dependencies into their own disk space, so that your projects do not interfere with each other.

After the Beginner Stage

By the end of the beginner phase, you’ll have a solid understanding of core machine learning concepts, hands-on experience with basic models, and can read the mathematical formulation. Through your focus on (small) projects, you have practiced a variety of ML methods. This is a good foundation to tackle more complex topics in the next phase, the intermediate phase. It covers years 1 to 3 of your ML journey, and is currently in preparation. Come back here later or follow me or Towards Data Science to get notified when it is published.

Resources:

Boston House Pricing dataset: https://mirahari.github.io/boston/
ESC50 audio dataset: https://github.com/karolpiczak/ESC-50
Heartbeat categorization: https://www.kaggle.com/datasets/shayanfazeli/heartbeat
2014 papers for machine translation: https://arxiv.org/abs/1409.0473, https://arxiv.org/abs/1409.3215, https://arxiv.org/abs/1406.1078
Beginner phase Notion checklist: https://phrasenmaeher.notion.site/The-three-phases-of-learning-Machine-Learning-The-beginner-phase-17963d033fa180bea2cacddaf580de3d
Google Colab: https://colab.research.google.com/
MIT’s Introduction to Deep Learning course: https://introtodeeplearning.com
Virtual environments in Python: https://realpython.com/python-virtual-environments-a-primer/

The post The Three Phases of Learning Machine Learning appeared first on Towards Data Science.

Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT

Shirley Li — Tue, 21 Jan 2025 18:19:27 +0000

Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT

(Image from Unsplash)

This is the third article in our GPT series, and also the most practical one: finally, we will talk about how to effectively fine-tune LLMs.

It is practical in the way that, if you were asked to train your own LLMs today, you can skip pre-training and jump straight into using an open-source LLM or SLM; However, very likely you’ll still need to finetune it a bit on your own data and task, and that is where this article can come to help.

More specifically, we will focus on two finetuned models – Codex and InstructGPT, as they represent addressing two types of challenges in LLM finetuning:

Codex needs to adapt a pretrained LLM to a different modality (code script), as programming languages have many unique characteristics than natural language;
InstructGPT aims to make the model more aligned with human preferences, which cannot be achieved automatically by traditional language modeling objectives.

As we will see later, both challenges demand creativity and carefulness at every stage of the finetuning process: how to collect high-quality data, how to modify model architectures, how to effectively initialize your model, how to determine a proper objective, and how to properly evaluate it.

Below is the outline for this article:

Overview: why we need finetuning and what makes it so challenging; GPT3.5 and its finetuned versions.
Codex: how to evaluate code generation properly, how to collect data and how to adapt the model to process programming languages.
InstructGPT and ChatGPT: how to evaluate alignment, why RLHF works, and how it is implemented in InstructGPT.
Summary: best practices in LLM finetuning.

Below are the links to our previous articles if you are interested:

Part 1: An In-Depth Look at GPT-1 and What Inspired It: where we cover the pre-training plus finetuning paradigm and its evolution from CV to NLP, previous pre-training efforts such as Word2Vec and GloVe, decoder-only Transformers, auto-regressive vs. auto-encoding LM, and key innovations of GPT-1.
Part 2: GPT-2 and GPT-3: where we cover how GPT models were scaled up from 117M to 175B, under the philosophy of exploring task-agnostic pre-training via scaling hypothesis and in-context learning.

Overview

As we explained in our second article, both GPT-2 and GPT-3 can be considered as OpenAI’s experiments to test the potential of task-agnostic pre-training. While doing so, the authors also mentioned finetuning as a promising direction for future studies, as it might help the model to further improve its performance on certain tasks.

Why is Finetuning Needed?

The reasons are three-fold.

The first reason is of course performance. Pre-trained models are more like generalists that can perform a wide range of tasks reasonably well, but still they might struggle to beat the specialists trained on a particular task. If our goal is to have such a specialized model to help us on a very specific task, then finetuning should be definitely considered.

Another reason is that, albeit being generally powerful, GPT-3 models are not always reliable in following human instructions, especially when those instructions became complex. This is because, as the authors explained in InstructGPT paper, that the pre-training objective focuses mainly on language modeling like predicting the next token, but such capabilities cannot translates to instruction-following. Thus, some special finetuning strategies are needed.

There are also concerns on safety and ethical aspects, due to very similar reasons that auto-regressive language modeling alone is not sufficient to enforce the model to avoid generating harmful or biased answers. For that issue, finetuning can also enable us to better control the generation process.

Challenges in Finetuning

Broadly speaking, there are two types of challenges in finetuning LLMs: the need to adapt to a new modality, and the need to align the model with human preferences.

Taking Codex as an example for the former case, where the pre-trained model needs to be applied to a different modality that presents some unique characteristics, for example, to process code scripts it needs to understand basic syntax of a specific programming language, handle static and dynamic types and even infer types, and correctly handle indentations in languages like Python.

The latter case is more tricky in some way, as "alignment" itself is a pretty vague and controversial concept, and it has to be defined more clearly and translated to a set of measurable aspects before we can actually finetuning towards that goal. Moreover, even if we have worked out a definition of alignment, achieving that goal is also non-trivial, as there is no ready-to-use training objectives directly connect to it.

On top of that, we also need to collect high-quality domain-specific training data and rethink the evaluation process, including the evaluation dataset as well as the evaluation metrics to use.

In later sections, we will see how Codex and InstructGPT handled these issues. In particular, we will highlight how they implemented every step with both creativity and carefulness, from which anyone who wants to finetune his or her own LLM can learn something.

GPT-3.5

GPT-3.5 series typically refer to the model series finetuned on top of GPT-3, including the following variants (see wiki):

code-davinvi-002: a version of Codex.
text-davinci-002: a transitional model from GPT-3 to InstructGPT.
text-davinci-003: more similar to InstructGPT.

Overall, GPT-3.5 could be considered as finetuned GPT-3 with enhanced instruction following, better generation quality, and better steerability. It is the foundation to several other models including ChatGPT, Codex, Whisper and the text model of DALL-E2, which demonstrates the potential of effectively finetuning LLMs on specialized tasks.

In the following sections, we will dive deeper into Codex and InstructGPT. Rather than covering every detail of their finetuning process, we will mainly focus on the aspects that best showcase the importance of creativity and carefulness.

Codex

The Codex model was released in 2021 and is specialized in Python code-writing.

Below are a few aspects that we want to highlight.

Evaluation of Code Generation

When building a model for a new task, the first thing that often comes to mind is how to evaluate that task properly.

This is important because, without an effective evaluation protocol, we cannot determine if we are really making any progress, and sometimes we even cannot identify the gaps in our current model in the first place.

In the case of Codex, the authors first realized that standard match-based metrics such as BLEU score are not suitable for measuring code generation performance.

In case you are not familiar with BLEU score: it is widely used for evaluating text generation tasks such as machine translation, by comparing overlapping phrases and calculating a precision score, while also considering text length to ensure balance.

However, the same coding problem might be solved with different data structures or algorithms. For example, generating a Fibonacci sequence can be implemented by either a top-down or bottom-up DP algorithm, resulting in very different code scripts:

def fib_top_down(n, memo={}):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_top_down(n-1, memo) + fib_top_down(n-2, memo)
    return memo[n]

def fib_bottom_up(n):
    if n <= 1:
        return n
    dp = [0] * (n + 1)
    dp[0], dp[1] = 0, 1
    for i in range(2, n + 1):
        dp[i] = dp[i-1] + dp[i-2]
    return dp[n]

In that case, if we evaluate both solutions against a given reference solution using BLEU score, it is very likely that one or even both solutions will have very low BLEU scores, even though both solutions are correct.

An alternative way is to evaluate by what the authors called "functional correctness", for example the pass@k metric used by Kulal et al, where for each problem we will generate k code samples and test each of them, and then a problem can be considered as solved if any sample passes the unit tests. In the end, the total fraction of problems solved is reported. However, as the authors pointed out, calculating pass@k with this definition will result in high variance due to randomness in this process, especially when k is small.

To mitigate this issue, the authors propose another way to estimate pass@k: instead of generating k samples directly, they generate n ≥ k samples per task. As more samples are generated and tested, the estimation process will be more reliable even if k is small. And then, based on how many samples are correct (assume c samples passes unit tests), an unbiased estimator can be estimated as below:

Figure 1. Left: the optimized pass@k definition. right: a numerically stable script to calculate pass@k. (image from Codex paper.)

where

C(n, k) is the number of ways to choose k samples out of n;
C(n-c, k) is the number of ways to choose k samples out of the (n-c) incorrect samples;
Thus, C(n-c, k)/C(n, k) represents the probability that all chosen samples are incorrect;
Finally, 1 – C(n-c, k)/C(n, k) represents the probability that at least one sample is correct.

To further prove that optimizing for BLEU score is not equivalent to optimizing for functional correctness, the authors also plot the BLEU score densities for correct (blue) and wrong (green) solutions for 4 random coding problems, where the distributions are clearly not separable:

Figure 2. BLEU score probability density for correct (blue) and wrong (green) solutions for 4 random problems. (Image from Codex paper.)

Beyond optimizing for the evaluation metric, the authors also built a new dataset called HumanEval, which contains 164 hand-written programming problems. As shown in the example below, each problem includes a function signature, a docstring, a body and an average of 7.7 unit tests:

Figure 3. Example problems from the HumanEval dataset. (Image from Codex paper.)

Note that as the authors mentioned in the paper, it is important for these tasks to be hand-written, since otherwise the problems for evaluation might be overlap with that for training. Also, to ensure the testing process will not pose any risks due to malicious code, the authors also created a sandbox to execute code scripts.

Training Data Collection

Moving to the training part, the first question is how to collect high-quality training data. For code generation, the good news is that we can leverage the vast amount of code repositories from GitHub, but still some data cleaning strategies are needed, as the paper mentioned:

We filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters.

Note that most of these cleaning strategies are specialized to programming languages, so we might need to come up with other ideas when cleaning our own data.

Adaptations in Finetuning

The most important adaptation is for the tokenizer, due to the obvious reason that the distribution of words in GitHub code differs a lot from that of natural language. In the Codex paper, the authors noted that this is especially the case when encoding whitespaces, making the original GPT-3 tokenizer less effective.

To fix that issue, an additional set of tokens were added to the vocabulary, to represent whitespace runs of different lengths. As mentioned in the paper, this simple modification enables representing code with 30% fewer tokens.

So, if our model needs to handle an input corpus presents different distribution with natural languages, we might need to do some study on the distribution and modify the tokenizer a bit as well.

Findings in Evaluation

Firstly, the figure below shows the pass rates of different models on the HumanEval dataset. Overall, all the Codex variants show significantly better performance compared to GPT-3, where

Codex (finetuned on code) solves 28% of the problems;
Codex-S (finetuned on standalone functions) solves 37.7%;
Codex-S with generating 100 samples and selecting the one with the highest mean log-probability solves 44.5%;
Codex-S oracle which selects the sample that passes the unit tests solves an amazing of of 77.5% problems.

Figure 4. Codex pass rates. (Image from Codex paper.)

Plus, a scaling law similar to that of GPT-3 is also observed, suggesting better performance can be achieved with even larger models:

Figure 5. Test loss vs. number of parameters. (Image from Codex paper.)

And the authors also noticed that higher temperatures are more preferred for larger k, highlighting the importance of careful hyper-parameter tuning:

Figure 6. Higher temperatures are preferred for larger k. (Image from Codex paper.)

InstructGPT and ChatGPT

Evaluation of Alignment

How to properly evaluate "alignment" is also challenging, as the definition of alignment is not as clear as other aspects such as accuracy. In this work the authors define alignment as if the models are "helpful, honest, and harmless" and convert them to more measurable properties:

Helpful: by measuring if the model could follow instructions and even infer intentions from a few-shot prompt.
Honest: by measuring truthfulness, or in the author’s words, "if the model’s statements about the world are true". More specifically, they propose to measure it by hallucination rate on the TruthfulQA dataset.
Harmless: by measuring "if an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content", and benchmarking on datasets designed to measure bias and toxicity.

On top of that, to make sure the finetuning process will not cause severe regressions on pre-training performance, the evaluation process also need to reflect quality on both the pre-training and finetuning objectives. For that reason, InstructGPT was evaluated on two separate datasets:

Evaluations on API distribution: this is mainly for evaluating the finetuning quality, by asking human labelers to rate which output is preferred;
Evaluations on public NLP datasets: this evaluates both the pre-training and finetuning quality, including traditional NLP datasets as well as datasets for evaluating model safety like truthfulness, toxicity and bias.

Next, we will briefly explain how RLHF works and how it is implemented in InstructGPT.

RLHF (Reinforcement Learning from Human Feedback)

The figure below shows the 5 elements in a typical Reinforcement Learning scenario:

Figure 7. Five elements in RL: Agent, Environment, Reward, State and Action. (Image from wiki.)

Now imagine you are teaching your puppy to sit, where you can find all the 5 elements:

Agent: Your puppy learning this new command "sit".
Environment: Everything around your puppy.
State: The situation your puppy is in (whether it is sitting or not).
Reward: A treat that you give your puppy when it follows your command;
Action: What your puppy could do, like sitting, jumping or barking.

Reinforcement Learning works like this: In the beginning your dog (agent) didn’t understand what "sit" means, but it will try different things like running, sitting or even barking (actions) in your house (environment). Every time it sits, it will get a treat (reward). Over time your puppy learns that sitting gets a treat and it appears like it finally understands "sit".

Training a model with RL follows a very similar trial-and-error approach. The key to RL is having a well-designed reward. This reward must be closely aligned with the goal; otherwise the agent will not be able to learn the desired behaviors. Meanwhile, producing such a reward should be as easy and quick as possible, since if it is too slow or too complicated to calculate the reward, the RL process will also become extremely slow, making it less useful in practical tasks.

For example, in a game, every action the agent takes will automatically get a score from the environment, and this score is directly connected to your agent’s performance in playing this game.

However, in many real-world applications, there is no ready-to-use reward like a score in a game. Instead researchers have to take great efforts in defining a proper reward function. Moreover, some desired behaviors are very difficult to translate into reward functions – for example, how could you define a reward function to guide the agent to answer questions more politely?

This leads to RLHF: Reinforcement Learning from Human Feedback.

Again in the puppy training example, imagine your puppy finally learns to sit, but sometimes it also barks while sitting, or it will jump onto the couch first instead of sitting quietly on the floor.

What can you do in that case?

With RLHF, you don’t just give your puppy a treat every time it sits. Instead, you give treats by comparing its behaviors. For example, if the puppy sits quietly on the floor, it gets a bigger reward than if it sits while barking or after jumping onto the couch. This way, your puppy learns that sitting quietly on the floor is better, even though you didn’t explicitly explain what "quiet" means.

As we mentioned before, having an easy and fast reward is the key to RL, which makes it unrealistic to involve a human into the training loop to provide direct feedback. To overcome this issue, we can collect some human feedback first, and then use these feedback to learn a reward function to mimic human preferences when comparing two actions.

In summary, RLHF typically involves three stages:

Collect human feedback: sampling model outputs, and ask human judges to compare which is better.
Learn a reward model by mimicking human judge’s preferences.
Train a better policy using the leant reward model in the RL process.

In case you are not familiar with RL terminology: a policy refers to the agent’s strategy to choose actions based on the state of the environment.

Next we will cover how this RLHF approach is implemented in finetuning InstructGPT.

Implementation of RLHF in InstructGPT

InstructGPT and ChatGPT were trained using the same model (see this blog), with RLHF being the key element in finetuning.

The training process largely follows the steps we have introduced in the previous section, with special care on data quality and implementation details, which in my opinion, are equivalently important to make InstructGPT such a success.

Now let me break it down.

Figure 8. An illustration of the RLHF steps in training InstructGPT/ChatGPT. (image from InstructGPT paper.)

Step 1: Collect demonstration data and train a supervised policy

In this step, human labelers were asked to provide high-quality demonstrations of the desired behavior for each prompt.

Prompt dataset: To begin with, you need to have a prompt dataset from which you can sample individual prompts, and ideally that prompt dataset should be both useful and diverse.

To do that, the authors took an iterative approach: in the very beginning, labelers were asked to manually write some seed prompts, and these data were used to train a model via supervised learning. This model was later deployed to the OpenAI API to collect text prompts from users, which later formed the prompt dataset.

The table below shows the distribution of this prompt dataset, as diversity is very important in making sure the model will be trained on various tasks:

Human data collection: human data are needed in three components throughout the RLHF process, including writing demonstrations in Step 1, providing comparison data in Step 2, and conducting final evaluations after finetuning.

In the paper the authors mentioned many practices to ensure data quality:

Firstly, high-quality data come from good labelers. To ensure their ability in data labeling, a screening test was conducted to select labelers who were "sensitive to the preferences of different demographic groups, and were good at identifying outputs that were potentially harmful".
Secondly, to ensure consistency between all the labelers, an onboarding process was setup to train all labelers, and detailed instructions for each task were provided. The authors also mentioned that they setup a shared chat room to answer questions from labelers.
Finally, to see how the model generalizes to the preferences of different labelers, a separate group of labelers who didn’t got through the screening test were hired for evaluation.

Based on these human demonstration data, a pretrained GPT-3 model was finetuned using supervised learning in the first step. This model is referred to as the baseline policy, which will be used to produce comparison outputs in Step 2 and initialize the PPO algorithm in Step 3.

Step 2: Collect comparison data and train a reward model

Comparison data collection: Once the baseline policy is available, it is used to generate outputs for some sampled prompts, and these outputs will be reviewed and ranked by human labelers from the best to the worst. To speedup this ranking process, a set of K outputs will be shown simultaneously to the human labelers, where K ranges from 4 to 9.

Reward model training: The reward model was initialized from the supervised baseline policy, by removing the final unembedding layer and training on the comparison data. In particular, the authors mention that training all comparisons from each prompt as a single batch rather than shuffling the comparisons can help alleviate overfitting. It was trained to assign scalar scores to input-response pairs, with 6B parameters. Note that we need to seek a balance when deciding the size of this reward model: it needs to be sufficiently large to accurately mimic human preferences, however it cannot be too large since it needs to support fast inference during the RL process.

Step 3: Optimize a policy using the reward model with PPO

At this point we have got everything ready to finetune the model with RLHF: the initial policy and the reward model. The training in this step follows a typical RL process: in each episode, a new prompt is sampled (the "state") and new outputs will be generated (the model’s "action") by the current policy (the "agent"), and then the reward model will calculate a reward for the output ("reward"), according to which the policy will be updated using PPO.

Don’t worry if you are not familiar with PPO – it is simply a method designed to help the agent to slowly update its strategies.

A few things to mention here:

A per-token KL penalty is added at each token to mitigate the over-optimization of the reward model.
The authors further experimented with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets (such regressions are often called "the alignment tax"), which was referred to as "PPO-ptx". In this paper, InstructGPT actually refers to the PPO-ptx models.

Note that Step 2 and Step 3 can be iterated continuously:

With an updated policy (from Step 3), we can generate new outputs and collect more comparison data, which can be used to train a new reward model by repeating Step 2;
With a new reward model (from Step 2), we can get a better policy by repeating Step 3.

Findings in Evaluation

Due to space limitation we will not go through all the evaluation results in this article, instead we will just highlight several new findings.

As perhaps the most important finding, results show that RLHF can indeed improve alignment. The figure below shows the win rate against the supervised 175B GPT3 model, evaluated by human judges. According to this figure, both PPO and PPO-ptx significantly outperform the GPT baselines, where even the 1.3B PPO models are better than the 175B GPT-3. This result clearly demonstrates the effectiveness of RLHF.

Figure 9. Human evaluation results. (Image from InstructGPT paper.)

The authors also found that InstructGPT show improves in truthfulness (hallucination rate reduced from 41% to 21%), slight improvements in toxicity (25% fewer toxic outputs), but no significant improvements on reducing bias.

Another finding is that PPO-ptx can minimize performance regressions on public NLP datasets, as shown in the figure below.

Figure 10. Few-shot performance on public NLP datasets. (Image from InstructGPT paper.)

Summary

Training a LLM usually involves multiple stages like pre-training, supervised finetuning, and alignment with RLHF. For our tasks at hand, we can usually start from an open-source, pre-trained LLM and finetune it on domain-specific data.

A few questions to ask while finetuning your own LLMs (though this is not meant to be an exhaustive list):

Do we have a clear definition on the model’s desired behaviors? How can we evaluate such behaviors? If no available metrics to use, can we create one by ourselves?
Do we have available training data? If not, how can we collect such data by ourselves? If human labelers are needed, how to ensure their labeling quality?
What kind of cleaning or pre-processing is needed? Any heuristics can we use to check the data quality?
Does our data cover a wide range of scenarios?
Do we need to modify our tokenizers? Do we need to modify the model structures? Do we need to add auxiliary finetuning objectives?
Does finetuning lead to regression on pre-training performance? Can we seek a balance?
Does finetuning lead to some unexpected negative behaviors? How can we mitigate that?
How to prevent overfitting in the finetuning process?
What hyper-parameters can we tune during finetuning or during evaluation? Any heuristics we can leverage?

In the end of the day, exploring a new task is always both challenging and exciting, and I hope the learnings from this article can help make it less challenging, more exciting, and ultimately more enjoyable

Thanks for reading!

The post Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT appeared first on Towards Data Science.

Influential Time-Series Forecasting Papers of 2023-2024: Part 1

Nikos Kafritsas — Fri, 17 Jan 2025 12:02:18 +0000

Influential Time-Series Forecasting Papers of 2023–2024: Part 1

Created with DALL3*3

Let’s kick off 2025 with a roundup of notable time-series forecasting papers.

I’ve also included some key papers from 2023 as well. 2024 was the year of foundation forecasting models – but I’ll cover those in a separate article.

The papers I’ll cover are:

Deep Learning-Based Forecasting: A Case Study From the Online Fashion Industry
Forecasting Reconciliation: A Review
TSMixer: An All-MLP Architecture for Time Series Forecasting
CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

Let’s get started!

I’ve launched AI Horizon Forecast, a newsletter focusing on time-series and innovative AI research. Subscribe here to broaden your horizons!

Deep Learning-Based Forecasting: A Case Study From the Online Fashion Industry

This paper is a gem.

It made headlines initially because Zalando, a leading online fashion retailer, used a custom Transformer model to outperform SOTA forecasting models – including LGBM.

But it’s more than that – this paper offers depth and is authored by brilliant researchers in the time series/ML ecosystem.

Paper Insights Overview

The paper describes Zalando’s in-depth retail forecasting pipeline and offers valuable insights:

Challenges unique to online retail fashion forecasting.
How to handle sparsity, cold starts, and short history (Figure 1).
What external covariates were used and how the Transformer is built to leverage them?
How elegant tricks, with interesting explanations, improved the vanilla Transformer and tailored it to time-series forecasting.
Extensive details on the custom loss function, evaluation metrics, hyperparameters, and training configurations – rarely disclosed in many papers.

Figure 1: Three edge cases in Zalando’s dataset: Cold start, short history, and sparsity (Source [1])

The major contributions/key points of this paper are:

Covariate Handling: Using discount as a dynamic covariate.
Causal Relationship: Enforcing a monotonic relationship between discount and demand via a piecewise linear function parameterized by feedforward networks.
Two forecasting modes: Training and predicting short- and long-term demand with a single global model.
Scaling laws: Demonstrating the first sparks of scaling laws in Transformer forecasting models.

First, the authors explain how they configured the forecasting problem by translating the sales observations into demand forecasting. Demand planning is the most common case in retail forecasting.

Also, the authors cleverly impute demand (e.g., it can happen when there’s a stock shortage) instead of marking it as missing.

Note: Demand Forecasting is the first and most important step in a supply forecasting pipeline

Demand represents what customers want to buy, while sales are restricted by stock availability. Forecasting demand is crucial, as it reflects true customer needs, unlike sales, which are supply-dependent.

If sales drop to zero due to stock shortages, future forecasts based on sales will also reflect zero. This can mislead automated supply planning systems, leading to no new orders.

To ensure an efficient supply chain, always forecast demand instead of sales. Accurate demand forecasts prevent disruptions and maintain stock to meet customer needs.

Zalando’s Forecasting Pipeline

The paper categorizes features into 4 types:

static-global: time-independent & market-independent
dynamic-global: time-dependent & market-independent
dynamic-international: time-dependent & market-specific
static-international: time-independent & market-specific

Table 1: Covariate types in Zalando’s forecasting pipeline (Source [1])

While all covariates enhance performance, discount is the most influential. Also, data has a weekly frequency.

The Zalando team trained a global Transformer model with:

Input: A single tensor of dimension R𝑡×𝑢×𝑣, where 𝑡 is the time dimension, 𝑢 the batch size and 𝑣 the dimension of the covariate vector.
Output: near future is 5 weeks and far future is __ 5 to 20 weeks.

The authors input details for different products, markets, and discount levels – and the output of the forecasting engine is a 𝑅𝑎×𝑐×𝑡×𝑑 tensor that spans over 𝑡 = 26 future weeks, up to 𝑎 = 1 × 10⁶ products, 𝑐 = 14 countries and 𝑑 = 15 discount levels.

Figure 2 describes the top-level view of the Transformer model:

Figure 2: Top-level view of custom Transformer architecture (Source[1])

The model uses an encoder-decoder structure with time-series-specific adaptations:

Encoder: Processes past observed covariates.
Decoder: Handles future known inputs, generating short and long-term forecasts.
Non-autoregressive Decoder: Produces forecasts in a multi-step manner.
Positional Embeddings: They are not added to tokens as in the traditional Transformer – but as an extra covariate.

The key innovation was natively embedding the idea of a monotonic increase in demand with higher discounts into the model.

This makes sense since the 2 values are positively correlated – but if this relationship is not imposed in the model, it won’t necessarily occur all the time.

Future demand for a given discount is modeled via a piecewise linear, monotonic demand response function. This function is parameterized by 2 FFNs in the Monotonic Demand Layer. A softplus transformation ensures the demand function is monotonically increasing w.r.t. to discount:

Figure 3: Softplus activation function (Source)

Figure 4: Demand as a piecewise linear monotonic demand response function (Source[1])

Finally, an important aspect of this paper was the first spark of scaling laws in forecasting. It was the first time a paper showed on a private, large, and diverse dataset that a Transformer-based forecasting model leveraged scale to deliver better performance:

Figure 5: Log-log plot of Zalando’s Transformer forecasting model vs simpler model, in terms of performance against training size (Source [1])

This work established a foundation for large-scale, time-series forecasting models.

In general, it’s a well-written paper with interesting details that we didn’t cover here – but we’ll discuss them in a future article. Stay tuned!

Forecasting Reconciliation: A Review

Hierarchical forecasting and reconciliation are among the most active areas in time-series research.

It all started with a research team at Monash University (which included the distinguished professor Rob Hyndman) publishing the iconic paper [Hyndman et al., 2011]. Of course, work on this area had started earlier.

Hierarchical forecasting reached the broader machine-learning community through the M5 forecasting competition.

The paper [2] written by the same school of thought researchers is the best resource for tracking the latest forecasting reconciliation research.

Preliminaries on Forecasting Reconciliation

Consider a hierarchical time series with multiple levels, such as product and store. Your task is to forecast for all levels.

Figure 6: An overview of how the M5 time series levels are organized (Source)

Naive approach: We separately forecast sales for each product, store, and the total. However, the top-level forecasts won’t align with the sum of the lower levels, making them incoherent.
Bottom-up approach: Here, forecasts start at the bottom level and aggregate upwards. This avoids information loss but isn’t optimal. Lower-level series are often noisy or sparse, causing errors to accumulate as you aggregate.

Other approaches, like top-down and middle-out, have similar challenges.

The best way is to establish a mathematical reconciliation method:

Forecasting reconciliation is not a time-series model, but a process that ensures consistency across different levels of forecasts. Given base forecasts across all levels, this process optimally makes base forecasts coherent.

Note: Some models generate coherent forecasts directly (see [2]).

Why Use Forecasting Reconciliation?

Tutorials often present forecasting reconciliation as a way to ensure forecasts make sense, avoiding questions from a manager like why the sum of product A and B sales doesn’t match the total estimate.

Ok, this is also true. But that’s not all.

Reconciliation often improves forecast accuracy in hierarchical datasets. Here’s why:

You can use the best model for each level.
For example, models at different levels can leverage distinct loss functions (e.g., Tweedie loss for non-negative, right-skewed lower-level data; Huber loss at the top levels for outlier-prone top levels).

Once you’ve generated the base optimal forecasts at each level, you can select an appropriate reconciliation method from [2]. And just like that – boom! Your forecasts will not only be coherent but also likely more accurate overall!

Some Notation

Let’s break down hierarchical forecasting.

Suppose we have 3 stores, each selling 2 products. This setup gives us 10 time series in total, with 6 at the bottom level.

For instance, yA,t represents the sales of store A at time t and yAA,t represents the sales of product A. The top-level yt aggregates total sales.

Figure 7: A 2-level hierarchical tree diagram. The bottom level contains 6 time series, and we have 10 time series in total (Source [2])

The equations which describe this graph are the following:

The equations describing this hierarchy are as follows:

The vector yt represents the total sales, S is the matrix summing these observations, and bt is the vector of bottom-level observations/sales at time t.

Let’s zoom into the values of the above equation:

Hence, the summing matrix S determines the hierarchical structure.

Bottom-Up Approach

Reconciliation aims to adjust incoherent base forecasts to make them coherent and as accurate as possible.

We achieve this using a projection (or reconciliation) matrix G:

where:

y_tilde: reconciled forecasts at all levels
S: summing matrix
G: projection matrix
yhat: base forecasts at all levels
h: horizon

and in matrix format:

Notice that in G, the first 4 columns zero out the base forecasts at the bottom level, and the other columns pick only the base forecasts of the bottom level.

Advanced Reconciliation Methods

The bottom-up approach is straightforward – it aggregates forecasts to ensure coherence.

However, it’s not a native reconciliation method since forecasts are made coherent simply by aggregation.

That’s why the bottom-up approach has a simple projection matrix G, but in practice, we can devise a reconciliation method by solving for the matrix G.

The goal is to find the optimal G matrix according to equation 6 that gives the most accurate reconciled forecasts.

The paper starts with the simple OLS equation for estimating G:

G = (S'S)^-1 S'

And continues with the milestone paper Wickramasuriya et al. (2019) which introduces the MinTrace (MinT) method that minimizes the sum of variances of the reconciled forecast errors.

The matrix G is then computed as:

G = (S'W^(-1) S)^-1 S' W^(-1)

where Wh is the variance-covariance matrix of base forecast errors.

We won’t go into further details here, feel free to read the original paper[2].

Key Challenges and Tools

The authors discuss additional reconciliation approaches and how these can be enhanced with probabilistic forecasting or accelerated using sparse methods. However, reconciliation methods can be memory-intensive, especially with datasets containing hundreds of time series.

They also cover domain-specific applications of hierarchical forecasting and review software tools and libraries for implementation.

The paper is detailed and well-written but assumes some familiarity with the field. For beginners, I recommend Rob Hyndman’s and George Athanasopoulos’s Forecasting: Principles and Practice book which has a chapter dedicated to hierarchical forecasting.

TSMixer: An All-MLP Architecture for Time Series Forecasting

Boosted Trees excels at forecasting tabular-like time series data with multiple covariates.

However, deep learning models have struggled to leverage their architectures effectively on complex datasets due to overfitting issues.

Google researchers addressed this by developing TSMixer, a lightweight MLP-based model, and its enhanced variant, TSMixer-Ext, which accommodates exogenous and future-known variables. TSMixer-Ext delivered a remarkable performance in Walmart’s M5 competition, surpassing established SOTA models.

TSMixer stands out for its double-mixing operation, utilizing cross-variate information across both the time domain (time-mixing layer) and feature domain (feature-mixing layer).

Figure 8: Top-level view of TSMixer, featuring Time-Mixing and Feature-Mixing MLPs arranged in N blocks. (Source [3])

We covered TSMixer and TSMixer-Ext in detail in the article here – check it out for more insights!

Find a hands-on tutorial on TSMixer in the AI Projects folder (Project 8)

Figure 9: TSMixer’s rolling-forward forecasts on the ETTm2 dataset with prediction length = 48 (Image by author, Source)

CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting (ICLR 2024)

The release of the DLinear and TSMixer papers exposed weaknesses in Transformer-based forecasting models.

In multivariate setups (where models learn cross-variate dependencies across all time series and predict them jointly), Transformers often overfit.

The quadratic self-attention mechanism tended to capture noise, limiting performance – especially on smaller or toy datasets. As a result, newer attention-based approaches shifted to univariate training, yielding impressive results.

CARD revisits the multivariate scenario, aiming for SOTA results by redesigning attention mechanisms for time series.

Dual Channel Attention

Earlier, we saw how TSMixer uses a double MLP mixing operation across time and feature dimensions. Similarly, CARD applies attention across both token (time) and channel (feature) dimensions:

Figure 10: Top-level view of CARD’s architecture (Source [4])

Figure 10 is very similar to TSMixer’s architecture in Figure 8. The authors sparsify attention further with summary tokens, exponential smoothing, and extra projections. These additions reduce complexity (Figure 11), helping the model distill meaningful information.

Figure 11: Operations inside CARD’s attention block (Source [4])

Note: CARD partitions signals into patches, meaning tokenization occurs at the patch level.

Patching benefits attention-based models because point-wise tokenization lacks semantic richness, while patches better capture a sequence’s characteristics.

Token Blend Module

In Transformer forecasting, each attention head focuses on specific aspects of the data or feature patterns in the input sequence. For example:

One head might focus on short-term dependencies (e.g., relationships between nearby time steps).
Another might capture periodic behavior (e.g., weekly or seasonal cycles).

Thus, within a single attention head:

Tokens encode related information about the input sequence, filtered through the head’s particular "lens" or learned feature pattern.
Tokens adjacent to each other correspond to neighboring time steps in the sequence and share temporal proximity.

In vanilla attention, tokens merge across heads. The proposed token blend module(Figure 12) modifies this by merging adjacent tokens within the same head after multi-head attention, creating tokens for the next stage:

Figure 12: Example of token blend block in CARD (Source [4])

Since each attention head specializes in specific features (e.g., trends or periodic patterns), merging tokens within the same head retains this focus while expanding the temporal scope.

Signal Decay-Based Loss Function

The authors note that errors are higher at the far future end of the horizon. Near-future predictions contribute more to generalization improvements than far-future ones.

To address this, they use a decay-based loss function – a MAE-based variant that weights near-future predictions more:

where:

a_t(A) are the observations at time t, given historical information A
ahat_t are the predictions at time t, given historical information A
L is the sequence length

Hence, this loss emphasizes near-future accuracy more.

In general, CARD is an impressive model. Unfortunately, it hasn’t been integrated into any known forecasting library, but the authors have open-sourced it. I’ll definitely create a tutorial on this model in future articles.

Spacetimeformer

The paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting[5]", introduced a new model called Spacetimeformer.

What makes Spacetimeformer advantageous?

There are 3 key levels in general:

Temporal Models: Traditional attention-based time series models represent multiple variables per timestep in a single token, overlooking spatial relationships between features (Figure 13b).
Graph Models: Graph attention models manually encode feature relationships using static, hardcoded graphs, which cannot adapt over time (Figure 13c).
Spatiotemporal Models: Spacetimeformer integrates temporal and spatial attention, representing each feature’s value at a specific timestep as an individual token. This approach captures complex interactions across space, time, and value (Figure 13d).

Figure 13: Attention in Multivariate Forecasting. (a) A sequence with 3 variables, 2 context points, and 2 target points to predict. (b) Temporal attention, where each token encompasses all three variables, with darker blue lines indicating stronger attention between tokens. © Temporal attention with spatial interactions, where spatial relationships within each timestep are modeled using predefined spatial graphs (black lines). (d) Spatiotemporal attention, where each variable at each timestep is treated as an individual token.

Previous Transformer-based models used one input token per timestep so that the embedding of the token at time 𝑡 represents 𝑁 distinct variables at that moment in time.

This is problematic because:

Lagged Information: In practice, dependencies can be lagged, such as y depending on xi-1 rather than xi. Time-based embeddings may overlook these lags.
Limited Receptive Field: Unlike NLP, where vocabularies are discrete, time-series data are continuous. Tokenizing by timepoints restricts receptive fields, limiting the capture of long-term, cross-temporal correlations.

Spacetimeformer’s Solution

Spacetimeformer addresses these limitations with the architecture shown in Figure 1d. For 𝑁 variables and a sequence length of 𝐿, we have:

Multivariate inputs of shape (𝐿, 𝑁) are flattened into sequences of shape (𝐿 × 𝑁, 1), where each token represents a single variable’s value at a specific timestep. This transformation enables the model to jointly learn attention across both space and time, creating the "spatiotemporal attention" mechanism illustrated in Figure 13D:

The embedding differences between a temporal and a SpatioTemporal model are shown in Figure 14:

Figure 14: The Spatiotemporal model flattens the N variables into a single sequence, and uses different types of embeddings to differentiate them (Source [5])

Figure 14 shows in detail the embedding types used by SpaceTimeFormer:

Given Embeddings: Multivariate inputs include time information, with missing values ("?") set to zero for predictions. These embeddings indicate whether values are provided as context or require prediction
Time Embeddings: Time sequences are passed through a Time2Vec layer to create time embeddings that capture periodic patterns.
Variable Embeddings: Each time series covariate is mapped to a spatial representation using a lookup-table embedding.
Value & Time Embeddings: Time2Vec embeddings and variable values are projected using a feed-forward layer.
Position Embeddings: Typical learned position embeddings used in other models like BERT (not shown in Figure 15)

Figure 15: The different types of embeddings in SpaceTimeFormer and how they are handled (Source [5])

All embeddings are summed to model relationships across temporal and variable spaces, but this increases input sequence length.

Given self-attention’s quadratic complexity, scaling becomes challenging for large N values. To address this, the following optimizations are applied:

Performer’s FAVOR+ Attention: A linear approximation of attention using a random kernel method.
Global and Local Attention: Local attention allows each token to focus on its spatial variable’s timesteps, while global attention lets tokens attend across the entire spatiotemporal sequence. This reduces computations without compromising temporal dynamics.

Figure 16 illustrates SpaceTimeFormer’s encoder-decoder architecture:

Figure 16: SpaceTimeFormer top-level architecture (Source [5])

Benchmark results show promising performance, with the model outperforming other notable implementations. Newer models like the ITransformer and CARD we saw earlier have since adopted the embedding mechanism across the time dimension.

Finally, attention weights can be extracted and visualized to reveal spatiotemporal relationships among variables:

Figure 17: Forecasting the temperature at 3 weather stations in Texas (lower left, blue diamonds) and 3 stations in New York (upper right, purple triangles) (Source 5)

You can find the official repo of SpaceTimeReformer here. This model has also been applied for other cases such as financial forecasting.

References

[1] Kunz et al. Deep Learning based Forecasting: A case study from the online fashion industry

[2] Athanasopoulos et al. _Forecast reconciliation: A review_

[3] Chen et al. **** _TSMixer: An All-MLP Architecture for Time Series Forecastin_g

[4] Xue et al. CARD: Channel Aligned Robust Blend TransFormer For Time Series Forecasting (ICRL 2024)

[5] Grigsby et al. Long-Range Transformers for Dynamic Spatiotemporal Forecasting

Thank you for reading!

The post Influential Time-Series Forecasting Papers of 2023-2024: Part 1 appeared first on Towards Data Science.

Why Data Scientists Can’t Afford Too Many Dimensions and What They Can Do About It

Niklas Lang — Thu, 16 Jan 2025 13:32:00 +0000

Photo by Paulina Gasteiger on Unsplash

Dimensionality reduction is a central method in the field of Data Analysis and Machine Learning that makes it possible to reduce the number of dimensions in a data set while retaining as much of the information it contains as possible. This step is necessary to reduce the dimensionality of the dataset before training to save computing power and avoid the problem of overfitting.

In this article, we take a detailed look at dimensionality reduction and its objectives. We also illustrate the most commonly used methods and highlight the challenges of dimensionality reduction.

What is Dimensionality Reduction?

Dimensionality reduction comprises various methods that aim to reduce the number of characteristics and variables in a data set while preserving the information in it. In other words, fewer dimensions should enable a simplified representation of the data without losing patterns and structures within the data. This can significantly accelerate downstream analyses and also optimize machine learning models.

In many applications, problems occur due to the high number of variables in a data set, which is also referred to as the curse of dimensionality. For example, too much dimensionality can lead to these problems:

Overfitting: When machine learning models become overly reliant on characteristics of the training dataset and therefore only provide poor results for new, unseen data, this is known as overfitting. With a high number of features, the risk of overfitting increases as the model becomes more complex and therefore adapts too strongly to the errors in the data set.
Computational complexity: Analyses and models that have to process many variables often require more parameters to be trained. This increases the computational complexity, which is reflected either in a longer training time or in increased resource consumption.
Data Noise: With an increased number of variables, the probability of erroneous data or so-called noise also increases. This can influence the analyses and lead to incorrect predictions.

Although large data sets with many characteristics are very informative and valuable, the high number of dimensions can also quickly lead to problems. Dimensionality reduction is a method that attempts to preserve the information content of the data set while reducing the number of dimensions.

What is the Curse of Dimensionality?

The Curse of Dimensionality occurs with high-dimensional data sets, i.e. those that have a large number of attributes or features. At first, many attributes are a good thing because they contain a lot of information and describe the data points well. For example, if we have a dataset about people, the attributes can be information such as hair color, height, weight, eye color, etc.

In mathematical terms, however, each additional attribute means a new dimension in the space and therefore a significant increase in possibilities. This becomes clear from the following example, in which we want to find out which customers buy which products. In the first step, we only look at the age of the prospective customers and whether they have bought the product. We can still depict this relatively easily in a two-dimensional diagram.

Diagram of Buyers divided by Age | Source: Author

As soon as we add more information about the customer, things get a little more complex. The information on the customer’s income would mean a new axis on which the numerical income is mapped. So the two-dimensional diagram becomes a three-dimensional one. The additional attribute "gender" would lead to a fourth dimension and so on.

When working with data, it is desirable to have a lot of attributes and information in the data set to give the model many opportunities to recognize structures in the data. However, it can also lead to serious problems, as the name Curse of Dimensionality suggests.

Data Sparsity

The example shown illustrates a problem that occurs with many attributes. Due to the large number of dimensions, the so-called data space, i.e. the number of values a dataset can take on, also grows. This can lead to what is known as data sparsity meaning that the training data set used to train the model does not contain certain values at all or only very rarely. As a result, the model only delivers poor results for these marginal cases.

Let’s assume that we examine 1,000 customers in our example, as it would be too time-consuming to survey even more customers or this data is simply not available. All age groups from young to old may be well represented among these customers. However, if the additional dimension of income is added, it becomes less likely that the possible characteristics, such as "young" and "high income" or "old" and "medium income", will occur and be backed up with enough data points.

Distance Concentration

If you want to evaluate the similarity of different data sets in the field of machine learning, distance functions are often used for this. The most common clustering algorithms, such as k-means clustering, rely on calculating the distance between points and assigning them to a cluster depending on their size. In multidimensional spaces, however, it can quickly become the case that all points are at a similar distance from each other so it seems almost impossible to separate them.

We are also familiar with this phenomenon from everyday life. If you take a photo of two objects, such as two trees, they can look very close to each other in the picture, as it is only a two-dimensional image. In real life, however, the trees may be several meters apart, which only becomes clear in three dimensions.

All these problems, which can occur in connection with many dimensions, are summarized under the term Curse of Dimensionality.

What are the Goals of Dimensionality Reduction?

Dimensionality reduction primarily pursues three primary goals: Improving model performance, visualizing data, and increasing processing speed. We will examine these in more detail in the following section.

Improving Model Performance

One of the main goals of dimensionality reduction is to improve model performance. By reducing the number of variables in a dataset, a less complex model can be used, which in turn reduces the risk of overfitting.

Models that have a large number of parameters and are therefore highly complex tend to overfit the training dataset and the noise in the data. As a result, the model delivers poorer results with new data that does not contain this noise, while the accuracy of the training data set is very good. This phenomenon is known as overfitting. During dimensionality reduction, unimportant or redundant features are removed from the data set, which reduces the risk of overfitting. As a result, the model delivers better quality for new, unseen data.

Visualization of Data

If you want to visualize data sets with many features, you face the challenge of mapping all this information in a two- or at most three-dimensional space. Any dimensionality beyond this is no longer directly tangible for us humans, but it is easiest to assign a separate dimension to each feature in the data set. Therefore, with high-dimensional data sets, we are often faced with the problem that we cannot simply visualize the data to gain an initial understanding of the peculiarities of the data and, for example, to recognize whether there are outliers.

Dimensionality reduction helps to reduce the number of dimensions to such an extent that visualization in two- or three-dimensional space is possible. This makes it easier to better understand the relationships between the variables and the data structures.

Increasing the Processing Speed

Computing time and the necessary resources play a major role in the implementation of projects, especially for machine learning and Deep Learning algorithms. Often, only limited resources are available, which should be used optimally. By removing redundant features from the data set at an early stage, you not only save time and computing power during data preparation but also when training the model, without having to accept lower performance.

In addition, dimensionality reduction makes it possible to use simpler models that not only require less power during initial training but can also perform calculations faster later during operation. This is an important factor, especially for real-time calculations.

Overall, Dimensionality Reduction is an important method for improving data analysis and building more robust machine-learning models. It is also an important step in the visualization of data.

Which Methods are used for Dimensionality Reduction?

In practice, various methods for dimensionality reduction have become established, three of which are explained in more detail below. Depending on the application and the structure of the data, these methods already cover a broad spectrum and can be used for most practical problems.

Principal Component Analysis

Principal component analysis (PCA) assumes that several variables in a data set possibly measure the same thing, i.e. are correlated. These different dimensions can be mathematically combined into so-called principal components without compromising the significance of the data set. The shoe size and height of a person, for example, are often correlated and can therefore be replaced by a common dimension to reduce the number of input variables.

Principal component analysis describes a method for mathematically calculating these components. The following two key concepts are central to this:

The covariance matrix is a matrix that specifies the pairwise covariances between two different dimensions of the data space. It is a square matrix, i.e. it has as many rows as columns. For any two dimensions, the covariance is calculated as follows:

Here n stands for the number of data points in the data set, _Xi is the value of the dimension X of the i-th data point and X̅ is the mean value of the dimension X for all n data points. As can be seen from the formula, the covariances between two dimensions do not depend on the order of the dimensions, so the following applies COV(X,Y) = COV(Y,X). These values result in the following covariance matrix C for the two dimensions X and Y:

The covariance of two identical dimensions is simply the variance of the dimension itself, i.e:

The covariance matrix is the first important step in the principal component analysis. Once this matrix has been created, the eigenvalues and eigenvectors can be calculated from it. Mathematically, the following equation is solved for the eigenvalues:

Here λ is the desired eigenvalue and I is the unit matrix of the same size as the covariance matrix C. When this equation is solved, one or more eigenvalues of a matrix are obtained. They represent the linear transformation of the matrix in the direction of the associated eigenvector. An associated eigenvector can therefore also be calculated for each eigenvalue, for which the slightly modified equation must be solved:

Where v is the desired eigenvector, according to which the equation must be solved accordingly. In the case of the covariance matrix, the eigenvalue corresponds to the variance of the eigenvector, which in turn represents a principal component. Each eigenvector is therefore a mixture of different dimensions of the data set, the principal components. The corresponding eigenvalue therefore indicates how much variance of the data set is explained by the eigenvector. The higher this value, the more important the principal component is, as it contains a large proportion of the information in the data set.

Therefore, after calculating the eigenvalues, they are sorted by size and the eigenvalues with the highest values are selected. The corresponding eigenvectors are then calculated and used as principal components. This results in a dimension reduction, as only the principal components are used to train the model instead of the individual features of the data set.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding, or t-SNE for short, approaches the problem of dimensionality reduction differently by attempting to create a new, lower-dimensional space that adopts the distances of the data points from the higher-dimensional space as far as possible. The basic idea of this becomes clear in the following example.

It is not easy to transfer data sets from a high dimensionality to a low dimensionality while retaining as much information as possible from the data set. The following figure shows a simple, two-dimensional data set with a total of 50 data points. Three different clusters can be identified, which are also well separated from each other. The yellow cluster is furthest away from the other two clusters, while the purple and blue data points are closer to each other.

Two-dimensional data for t-SNE | Source: Author

The aim now is to convert this two-dimensional data set into a lower dimension, i.e. into one dimension. The simplest approach for this would be to represent the data either only by its X or Y coordinate.

Dimensionality reduction by X- or Y-coordinate | Source: Author

However, it is clear that this simple transformation has lost much of the information in the data set and gives a different picture than the original two-dimensional data. If only the X coordinates are used, it looks as if the yellow and purple clusters overlap and that all three clusters are roughly equidistant from each other. If, on the other hand, only the Y coordinates are used for dimensionality reduction, the yellow cluster is much better separated from the other clusters, but it looks as if the purple and blue clusters overlap.

The basic idea of t-SNE is that the distances from the high dimensionality are transferred to the low dimensionality as far as possible. To do this, it uses a stochastic approach and converts the distances between points into a probability that indicates how likely it is that two random points are next to each other.

More precisely, it is a conditional probability that indicates how likely it is that one point would choose the other point as a neighbor. Hence the name "Stochastic Neighbor Embedding".

Dimensionality reduction according to t-SNE | Source: Author

As you can see, this approach leads to a much better result, in which the three different clusters can be clearly distinguished from one another. It is also clear that the yellow data points are significantly further away from the other data points and the blue and purple clusters are somewhat closer together. To better understand the details of this approach, you are welcome to read our detailed article on this topic.

Mastering t-SNE: A Comprehensive Guide to Understanding and Implementation in Python

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA for short) aims to identify and maximize the separability between classes by projecting the data onto a smaller dimension. In contrast to other methods, it places particular emphasis on maximizing the separability between classes and is therefore particularly important for classification tasks.

Two central key figures are calculated:

Within-Class Scatter: This key figure measures the variability within a class of data. The lower this value, the more similar the data points are within a class and the more the classes are separated.
Between-class scatter: This scatter in turn measures the variability between the mean values of the classes. This value should be as large as possible, as this indicates that the classes are easier to separate from each other.

Simply put, these key figures are used to form matrices and calculate their eigenvalues. The eigenvectors for the largest eigenvalues are in turn the dimensions in the new feature space, which has fewer dimensions than the original data set. All data points can then be projected onto the eigenvectors, which reduces the dimensions.

Linear Discriminant Analysis is particularly suitable for applications in which the classes are already known and the data is also clearly labeled. It uses this class information from supervised learning to find a low-dimensional space that separates the classes as well as possible.

A disadvantage of LDA is that the maximum reduction to a dimensional space is limited and depends on the number of classes. A data set with (n) classes can therefore be reduced to a maximum of (n-1) dimensions. In concrete terms, this means, for example, that a data set with three different classes can be reduced to a two-dimensional space.

This approach of dimensionality reduction is also particularly suitable for large data sets, as the computing effort is only moderate and scales well with the amount of data so that the computing effort is kept within limits even with large amounts of data.

What are the Challenges of Dimensionality Reduction?

Dimensionality reduction is an important step in data pre-processing to increase the generalizability of Machine Learning models and the general model performance and possibly save computing power. However, the methods also bring with them some challenges that should be considered before use. These include

Loss of information: Although the methods presented attempt to retain as much variance, i.e. information content, of the data set, some information is inevitably lost when dimensions are reduced. Important data can also be lost, which may lead to poorer analyses or model results. Therefore, the resulting model may be less accurate or less powerful.
Interpretability: Most algorithms used for dimensionality reduction provide a low-dimensional space that no longer contains the original dimensions, but a mathematical summary of them. Principal component analysis, for example, uses the eigenvectors, which are a summary of various dimensions from the data set. As a result, interpretability is lost to a certain extent, as the previous dimensions are easier to measure and visualize. Especially in use cases where transparency is important and the results require a practical interpretation, this can be a decisive disadvantage.
Scaling: Performing dimensionality reductions is computationally intensive and takes time, especially for large data sets. In real-time applications, however, this time is not available, as the computing time of the model is added to this. Computing-intensive models in particular, such as t-SNE or autoencoders, quickly fall out because they take too long to make live predictions.
Selecting the method: Depending on the application, the dimensionality reduction methods presented have their strengths and weaknesses. The selection of a suitable algorithm therefore plays an important role, as it has a significant influence on the results. There is no one-size-fits-all solution and a new decision must always be made based on the application.

Dimensionality reduction has many advantages and is an integral part of data pre-processing in many applications. However, the disadvantages and challenges mentioned must also be taken into account to train an efficient model.

This is what you should take with you

Dimensionality reduction is a method of reducing the number of dimensions, i.e. variables, in a data set while retaining as much information content as possible.
With many input variables, the so-called curse of dimensionality arises. This leads to problems such as data sparsity or distance concentration.
Dimensionality reduction often leads to a lower probability of overfitting, better visualization, and optimized computing power when training machine learning models.
The most common dimensionality reduction methods include principal component analysis, t-SNE, and LDA.
However, dimensionality reduction also has disadvantages, such as poorer interpretability or loss of information.

The post Why Data Scientists Can’t Afford Too Many Dimensions and What They Can Do About It appeared first on Towards Data Science.

Understanding Flash Attention: Writing the Algorithm from Scratch in Triton

Alex Dremov — Wed, 15 Jan 2025 17:01:59 +0000

Read for free at alexdremov.me

Flash Attention is a revolutionary technique that dramatically accelerates the attention mechanism in transformer-based models, delivering processing speeds many times faster than naive methods. By cleverly tiling data and minimizing memory transfers, it tackles the notorious GPU memory bottleneck that large language models often struggle with.

In this post, we’ll dive into how Flash Attention leverages efficient I/O-awareness to reduce overhead, then take it a step further by crafting a block-sparse attention kernel in Triton.

I will provide a simple explanation of how Flash Attention works. We will then implement the explained algorithm in Triton!

What is Attention?

The attention mechanism (or scaled dot-product attention) is a core element of transformer models, which is a leading architecture for solving the problem of language modeling. All popular models, like GPT, LLaMA, and BERT, rely on attention.

The formula is pretty simple:

The rest is history.

Even though the formula looks simple, its computation involves multiplications of large tensors and a lot of data movement. Considering that this is a core part of the transformer architecture, optimizing the algorithm greatly improves the performance of the model in general.

In the naive implementation, attention requires O(n²) additional memory and O(n²) compute time complexity, where n is the sequence length. That’s a lot!

Flash Attention

Core Idea

The main idea of Flash attention can be summarized in a simple quote from the original paper:

We argue that a missing principle is making attention algorithms IO-aware – accounting for reads and writes between levels of GPU memory.

That is, modern GPUs have several types of memory:

SRAM – fast, on-chip, small
HBM – slower than SRAM, large size. That’s what we usually address as GPU memory.

Check out the memory hierarchy in the image below to see the differences in bandwidth and sizes of different memory types.

Image from FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness by Tri Dao et al.

To conduct computation, data must be transferred from HBM to SRAM, and this transfer is not overhead-free!

The Flash Attention algorithm proposes a method of computing attention in tiles, without explicitly materializing the attention scores tensor:

Not materializing a matrix means that at any given time, the matrix does not exist in its full shape in memory.

It’s easy to see that this matrix requires O(n²) of memory to store. For large sequence lengths, that’s a lot of data! So, if we manage to avoid explicitly materializing this matrix, we can save lots of memory.

However, this matrix is necessary for transformer training as it is a part of backpropagation and gradient calculation. The authors propose that it’s better to recalculate this matrix during the backward pass (again without explicit materialization). Not only does this saves lots of memory, but it also provides huge speedups as we don’t need to transfer this enormous matrix between different GPU memory types.

Overall, such an approach did not only speed up calculations by taking GPU I/O specifics into account, but also allowed processing huge sequence lengths as memory complexity drops to O(n).

Tiled Attention Calculation

The last thing to understand is how to compute attention in tiles. Basically, this means that we will calculate attention over the full sequence by processing incoming tokens in small portions.

Well, it’s easy to calculate QK^T in tiles. Considering that attention dimension is not high, we can load full matrix rows and columns and conduct multiplication in tiles.

Yes, if we want to have an enormous attention dimension, Flash Attention will not work without algorithm modifications.

As dimensions are usually quite small even for enormous models, this limitation is fair.

Tiled QK^T | Image by the author

So, we have QK^T calculated in SRAM. All that’s left is to apply softmax, multiply by V, and that’s it!

That’s where the trick is.

The problem is that the softmax denominator requires aggregation over the sequence length to normalize scores, and we do not have access to the whole length as we load data in tiles.

To address it, we can implement a concatenated softmax algorithm. Using it, we can calculate softmax "in batch" mode: by adjusting computed values with the new incoming data.

Taking the algorithm from the original article, we can define rules to compute the softmax over data concatenation. Having two vectors x1 and x2, we need to calculate the softmax denominator l(x) over those vectors’ concatenation [x1, x2]. If the vector’s maximum is m(x), we can easily derive the softmax denominator of the concatenation:

The last equivalence can be easily verified as

So, now we have what we want – we can calculate softmax per-tile and then, by doing re-normalization from the formula above, compute the global softmax. The last thing to do is to incorporate the tile of the V tensor and keep doing the same re-normalization (as matrix multiplication is a linear operation).

And all of this without loading the full sequence into memory or materializing QK^T!

Notice that we calculate Softmax(QK^T) in tiles only, without needing to have the whole matrix at any moment.

Also, in the actual algorithm for numerical stability, we will compute not Softmax(x) but Softmax(x – max(x)). We can do that as softmax is invariant to constant shifts.

Triton Implementation

Now, we can easily implement the outlined algorithm in Triton, which is a tool that allows us to write efficient GPU kernels with the ease of Python.

To learn more about Triton, check out their official guides.

Tutorials – Triton documentation

Outlining the Algorithm

The first step is to decide how we will assign jobs and what data each job will load. By the algorithm of tiled softmax, each job must have access to K, V over the whole sequence length. So, each job will iterate over K, V in tiles. We don’t have any algorithmic restriction on the number of Q tiles processed. Therefore, each job will load just one Q tile and work with it only – this way we will maximize job parallelism.

Kernel jobs data management | Image by the author

In summary, each job will load a single Q tile, iterate over all tiles in K and V, and store one tile of result corresponding to the Q tile.

The Kernel

What’s left is to write the actual code. Let’s focus on the core part first, and only then we’ll add Triton-specific boilerplates.

Below is a Triton pseudocode with every line explained.

See? Easy!

What’s important is that you can see how simple it is to write such a thing as soon as we understand the idea of tiled softmax. Apart from that, there’s nothing complicated from the algorithm perspective.

This kernel can be made even faster by implementing triton optimizations. However, this is out of the scope of this article.

This pseudocode is pretty close to the actual code. You may find it in my GitHub by following the link. All that I added is just data management and Pytorch wrappers.

kernels/src/self_attention/kernel.py at main · alexdremov/kernels

Don’t hesitate to ask if something isn’t clear. I’m here in the comments .

The code above was extensively tested to match PyTorch’s scaled_dot_product_attention. You can also check out the tests to see how to use the written kernel.

Benchmarking

While we wrote the kernel in Triton to improve the algorithm understanding, it’s interesting to compare the performance with a naive implementation and PyTorch’s scaled_dot_product_attention.

Benchmarking implementations for different sequence lengths | Image by the author

As expected, the Flash Attention algorithm completely outperforms the naive implementation performance-wise. Also, I’ve marked with a dashed line the range of lengths for which the naive implementation causes a CUDA out-of-memory error.

We see that our Triton implementation is slightly worse than PyTorch SDPA. But the difference is not too large Considering the fact that PyTorch SDPA is a well-optimized CUDA kernel, that’s a nice result.

Benchmarking code is also available in the repository.

kernels/benchmark/benchmark_self_attention.py at main · alexdremov/kernels

This story was originally published on alexdremov.me Check it out! (at least, TEX looks better there)

Conclusions

In the post, I covered the motivation of the Flash Attention algorithm as well as its algorithm details. Finally, we were able to implement it from scratch in Triton, reproducing the speedups from the paper.

I hope this post improved your understanding of Flash Attention. Feel free to leave a comment below if you have any questions.

References

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tutorials – Triton documentation

GitHub – alexdremov/kernels: Collection of useful kernels

The post Understanding Flash Attention: Writing the Algorithm from Scratch in Triton appeared first on Towards Data Science.

LossVal Explained: Efficiently Estimate the Importance of Your Training Data

Tim Wibiral — Wed, 15 Jan 2025 14:01:59 +0000

LossVal Explained: Efficient Data Valuation for Neural Networks

How to exploit the loss function to efficiently estimate the importance of your training data

Data Valuation visualized. Understanding how important each training sample is for the performance of your machine learning model. (Image by author.)

This blog post summarizes and explains our paper "LossVal: Efficient Data Valuation for Neural Networks".

Not all data is created equal: Some training data points influence the training of a Machine Learning model much more than others. Understanding the influence of each data point is often highly inefficient and often relies on repeated retraining of the model. LossVal presents a new approach to this, that efficiently integrates the Data Valuation process into the loss function of an artificial neural network.

What is Data Valuation?

Machine Learning models are often trained with large datasets. In most cases, not all training samples in such a dataset are equally helpful or informative for the model. For example, if a data point is noisy or has a wrong label, it is less informative for your machine learning model. For one of the tasks in our paper, we trained a machine-learning model on a vehicle crash test dataset to predict how harmful a crash would be for an occupant, based on some vehicle parameters. Some of the data points are from cars of the 80s and 90s! You can imagine, that very old cars may be less important for the model’s predictions on modern cars.

The process of understanding the effect of each training sample on the machine-learning model is called Data Valuation, where an importance score is assigned to each training sample. Data Valuation is a growing field connected to data markets, explainable AI, active learning, and many more. Many approaches have been proposed, like Data Shapley, Influence Functions, or LAVA. To learn more about this, you can check out my recent blog post that presents different Data Valuation methods and applications.

LossVal

The basic idea behind LossVal is to "learn" the importance score of each sample while training the model, similar to how the model weights are learned. This saves us from rerunning the training of the model multiple times and from having to track all model weight updates during the training.

To achieve this, we can modify standard loss functions like means squared error (MSE) and cross-entropy loss. We incorporate instance-based weights into the loss and multiply it by a weighted distance function. In general, the LossVal loss functions have the following form:

where ℒ indicates the weighted target loss (weighted MSE or cross-entropy) and OT indicates a weighted distribution distance (OT stands for optimal transport). This results in new loss functions that can be used like any other loss function for training a neural network. However, during each training step, the weights w in the loss are updated using the gradient descent.

We demonstrate this for regression tasks using the MSE and for classification using the cross-entropy loss. Afterward, we take a closer look at the distribution distance OT.

LossVal for Regression

Let’s start with the MSE. The standard MSE is the squared difference between the model prediction ŷ and the correct prediction y (with n being the index of the training sample):

For LossVal, we modify the MSE in two steps: First, a weight wₙ is included for each training instance n. Second, the whole MSE is multiplied with a weighted distribution distance function.

LossVal for Classification

The cross-entropy loss is typically expressed like this:

We can modify the cross-entropy loss in the same way as the MSE:

The Optimal Transport Distance

The optimal transport distance is the minimum effort you need to transform one distribution into another. It is also known as the earth mover distance, coming from the analogy of the fastest way to fill a hole with a pile of dirt. OT can be defined as:

where c is the cost of moving point xₙ to _x_ⱼ. Each γ is a possible transport plan, defining how the points are moved. The optimal transport plan is the γ* with the least effort involved (the smallest distribution distance). Note that we include the weights w in the cost function via joint distribution Π(w, 1). In other words, OTᵥᵥ is the weighted distance between the training and the validation set. You can find an in-depth explanation for optimal transport here.

In a more practical sense, minimizing OTᵥᵥ by changing the weights will assign higher weights to the training data points similar to the validation data. Effectively, noisy samples get a smaller weight.

Implementation

Our implementation and all data are available on GitHub. The code below shows the implementation of LossVal for the mean squared error.

def LossVal_mse(train_X: torch.Tensor, 
                train_y_true: torch.Tensor, train_y_pred: torch.Tensor,
                val_X: torch.Tensor, sample_ids: torch.Tensor
                weights: torch.Tensor, device: torch.device) -> torch.Tensor:
    weights = weights.index_select(0, sample_ids)  # Select the weights corresponding to the sample_ids

    # Step 1: Compute the weighted mse loss
    loss = torch.sum((train_y_true - train_y_pred) ** 2, dim=1)
    weighted_loss = torch.sum(weights @ loss)  # Loss is a vector, weights is a matrix

    # Step 2: Compute the Sinkhorn distance between the training and validation distributions
    sinkhorn_distance = SamplesLoss(loss="sinkhorn")
    dist_loss = sinkhorn_distance(weights, train_X, torch.ones(val_X.shape[0], requires_grad=True).to(device), val_X)

    # Step 3: Combine mse and Sinkhorn distance
    return weighted_loss * dist_loss**2

This loss function works like any other loss function in pytorch, with some peculiarities: the parameters include the validation set, sample weights, and the indices of the samples in the batch. This is necessary to select the correct weights for the batched samples for calculating the weighted loss. Keep in mind that this implementation relies on the automatic gradient calculation of pytorch. That means the sample weights vector needs to be part of the model parameters. This way, the optimization of the weights profits from the optimizer implementation, like Adam. Alternatively, one could also update the weights per hand, using the gradient of the loss with respect to each weight i. The implementation for cross-entropy works equivalently, but you need to replace line 8.

But does it work?

Benchmark comparison of different Data Valuation methods for noisy sample detection. Higher is better. (Image by author.)

The graphic above shows the comparison between different Data Valuation approaches on the noisy sample detection task. This task is defined by the OpenDataVal benchmark. First, noise is added to p% of the training data, then the Data Valuation is used to find the noisy samples. Better methods will find more of the noisy samples, hence achieving a higher F1 score. The graph above shows the average over 6 datasets for classification and 6 datasets for regression. We tested 3 different noise types; noisy labels, noisy features, and mixed noise. In the mixed noise condition, half of the noisy sample have feature noise and the other half have label noise. In noisy sample detection, LossVal outperforms all other methods for label noise and mixed noise. However, LAVA performs better for feature noise.

The experimental setup for the point removal experiment (graphic below) is similar. However, here the goal is to remove the highest valued data points from the training set and see how a model trained on this training set performs. This means, that a better Data Valuation method will lead to a faster degradation in model performance because it removes important data points earlier. We found that LossVal matches state-of-the-art methods.

Benchmark comparison of different Data Valuation methods for removal of high-value points. Lower is better. (Image by author.)

For more detailed results, take a look at our paper.

Conclusion

The idea behind LossVal is simple: Use the gradient descent to find an optimal weight for each data point. The weight signifies the importance of the data point.

Our experiments show that LossVal achieves state-of-the-art performance on the OpenDataVal benchmark. LossVal has a lower time complexity than all other model-based approaches we tested and demonstrates a more robust performance over different types of noise and tasks.

Overall, LossVal presents a efficient alternative to other state-of-the-art Data Valuation approaches for neural networks.

Feel free to get in touch via LinkedIn

The post LossVal Explained: Efficiently Estimate the Importance of Your Training Data appeared first on Towards Data Science.

From Darwin to Deep Work

Pascal Janetzky — Tue, 14 Jan 2025 13:02:21 +0000

Focus strategies for machine learning practitioners

In December 1831, the HMS Beagle set sail on a nearly five-year journey around the world. Among its crew was 22-year-old Charles Darwin, eager to explore the tropical regions of the globe. During this voyage, Darwin collected fossils and developed a growing interest in the natural world, particularly in plants and species. The trip profoundly shaped Darwin’s thinking and laid the groundwork for his seminal work, On the Origin of Species, which was eventually published 23 years later in 1859.

Photo by Torsten Dederichs on Unsplash

Writing a book of such groundbreaking importance required clarity of thought. Darwin adhered to a strict daily schedule to stay productive. He would rise at 7 a.m. and dedicate the hours from 8 a.m. to noon to uninterrupted study and writing. These morning sessions were reserved for his most cognitively demanding work: analyzing his findings and refining his theories. After lunch, he would mentally work through challenges on long walks. Darwin’s disciplined routine and deep focus enabled him to tackle complex problems over extended periods – a testament to a concept called "Deep Work."

Deep Work

The term deep work was coined by computer science professor Cal Newport, in his book of the same name, Deep Work. Newport defines deep work as the ability to focus intensely on a single, cognitively demanding task. For Machine Learning (ML) practitioners, deep work is an especially critical and valuable skill.

Deep Work for ML practitioners

Machine learning research involves tasks that are intellectually demanding. Examples include are reading papers, solving mathematical problems, and programming algorithms. Each of these activities is cognitively taxing, requiring full attention over long spans of uninterrupted time.

Take reading research papers, for instance. For beginners, it can take three hours or more to read a single paper end-to-end. Following the authors’ arguments and understanding the underlying concepts is mentally exhausting. If you’re interrupted during this process – whether by an email notification or a colleague dropping by – it disrupts your train of thought. Research has shown that returning to the original point of focus after an interruption takes significant time and effort [1][2].

Reading a paper is only one aspect of machine learning, programming and mathematics equally require continued focus. Writing code to implement an algorithm is a creative and demanding task. Interruptions during coding sessions not only break your concentration but can lead to errors that are frustrating and time-consuming to debug. As any math student can attest, working through complex mathematical proofs or equations demands sustained focus; distractions make it nearly impossible to follow the logical flow.

The Modern Workplace

While being able to work deeply, especially in the field of machine learning, is essential, modern workplaces offer a wide array of distractions. Offices are built on the idea of communication, and thus email and instant messaging systems like Teams are commonly used. Although these tools make collaboration easier, they also encourage constant communication. Even when you are deep down coding a new algorithm, it’s easy to do a quick single click on a shiny icon, fragmenting your concentration.

Statistics suggest that the average worker spends about 50% of their day on email alone [3]. It’s not the case, as Cal Newport describes in Deep Work, that you can instantly switch your attention to answering these emails, and then directly go back to coding. Quite contrary, the residual effect of attention says that parts of your mental capacities are still caught in the previous task (answering messages) when you return to your initial task (coding) [2].

The residual effect is amplified when you could not finish the interrupting tasks (e.g., when a message you saw asks you to prepare a presentation for some event X), causing unfinished fragments to now linger around in your mind. It’s unlikely that Darwin would have been able to think clearly in such an environment.

Deep Work Strategies in ML

So, how can ML practitioners find the time to do what they are good at in a state of deep focus? At first glance, it is appealing to entirely disconnect from email and messaging apps and always focus on reading papers and coding ML algorithms. However, this is neither practical nor advisable. On the contrary, communication is crucial in any workplace, and many ML projects heavily benefit from exchange with colleagues. Ignoring this reality entirely could lead to professional consequences, such as projects being cancelled.

Thus, instead of eliminating communication, the goal should be to balance it with deep work.

Over the past six years of working in machine learning (see Further Reading at this article’s end), I’ve tried several strategies of integrating deep work into my ML practice. I’ve found that the best time to develop these habits is during your university or college years, when your schedule is more flexible. While lecture times might dictate parts of your day, you are still relatively free to structure your own studies.

Once you transition into the professional world, you’ll likely have less control over your schedule. However, even in these environments, you can still make time for deep work. Practically speaking, if deep work enables you to implement ML algorithms faster or develop better solutions, everyone benefits from your practice.

With that in mind, here are the three basic strategies that I found most practical for deeply working on machine learning. Use these as a starting point for implementing your own deep work practice.

Schedule Deep Work Blocks Pre-scheduling blocks of deep work is probably the fastest approach to making time for deep, uninterrupted work. For example, set aside some hours in the morning for tasks that require maximum focus, such as reading papers, coding, or developing theories. In my experience, I found three to four hours of uninterrupted concentration to be a high upper bound, a time after which I usually need a (lunch) break to recharge. Generally, during this deep work time I am silencing my notifications.

Having two blocks of deep work per day (e.g., one in the morning, one in the evening) is doable, but might require practice, especially if your are not (or no longer) used to concentrating intensively over longer periods of time. For starters, I usually recommend two hours first thing in the morning, and then expanding the block with experience. Bonus point: schedule these blocks the evening before.

Batch Communication Pre-scheduling blocks of deep work allows you to focus intensively, but it will not handle communication for you. Inspired by Newport, I suggest batching communication. Depending on your circumstances, you can check mail four times daily: directly in the morning, before lunch, after lunch, and late in the afternoon. I found that this approach minimizes interruptions while ensuring that you are still available for colleagues.

To repeat, skipping communication altogether is not an option and won’t help you and your colleagues. To some degree, machine learning lives from exchanging ideas and concerns – which is what conferences are made for.

Create a Distraction-free Environment Once you have pre-scheduled times for deep work and handling communication, it’s time to optimize your workspace for focus. In my experience, you have two levers: reducing sounds and reducing visual clutter. As for reducing sounds, I use ear plugs (alternative: noise-canceling headphones) to block noise. For reducing visual distractions (think: conference posters, coworkers walking by, etc.), I try to work from distraction-free booths. If you work from home at least partially, then you can follow Darwin, who retreated to his office to work without distractions.

Charles Darwin’s disciplined routine during the writing of On the Origin of Species offers valuable lessons for ML practitioners. To achieve breakthroughs, whether in natural science or machine learning, requires time to think and work deeply. In this article, I presented three strategies that I found most practical for implementing stretches of deep work into your machine learning practice: pre-schedule blocks of deep work, batch communications, and reducing environmental distractions. Start with these central strategies to implement your own deep work practices.

Acknowledgements: The introduction of Darwin is inspired by two Wikipedia entries. The story of Darwin’s routine is described in Deep Work by Cal Newport, another inspiration for this article.

Further reading:

How I’d Learn Machine Learning Again, After 6 Years

ML Beginners Should Read Papers

How (and Where) ML Beginners Can Find Papers

Deep Work: Rules for Focused Success in a Distracted World – Cal Newport

Sources:

The post From Darwin to Deep Work appeared first on Towards Data Science.

Deep Learning | Towards Data Science

The Art of Noise

Introduction

How Diffusion Model Works

PyTorch Implementation

The U-Net Architecture: Time Embedding

The U-Net Architecture: DoubleConv

The U-Net Architecture: Encoder

The U-Net Architecture: Decoder

The U-Net Architecture: Putting All Components Together

Dataset Preparation

Noise Scheduler

Training

Inference

Ending

References

Show and Tell

Introduction

Image Captioning Framework

Implementation

GoogLeNet Encoder

LSTM Decoder

GoogLeNet Encoder + LSTM Decoder

Ending

References

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Background

MHA in Decoder-only Transformers

Key-Value Cache

Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

RoPE (Rotary Positional Embeddings)

Multi-head Latent Attention

MLA: High-level Idea

Why Decoupled RoPE is Needed

Performance of MLA

References

The Three Phases of Learning Machine Learning

The beginner phase of learning Machine Learning

Data Handling

Classic Machine Learning

Neural Networks

Theory

Miscellaneous Skills

After the Beginner Stage

Further Reading:

Resources:

Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT

Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT

Overview

Why is Finetuning Needed?

Challenges in Finetuning

GPT-3.5

Codex

Evaluation of Code Generation

Training Data Collection

Adaptations in Finetuning

Findings in Evaluation

InstructGPT and ChatGPT

Evaluation of Alignment

RLHF (Reinforcement Learning from Human Feedback)

Implementation of RLHF in InstructGPT

Findings in Evaluation

Summary

Influential Time-Series Forecasting Papers of 2023-2024: Part 1

Influential Time-Series Forecasting Papers of 2023–2024: Part 1

Deep Learning-Based Forecasting: A Case Study From the Online Fashion Industry

Paper Insights Overview

Zalando’s Forecasting Pipeline

Forecasting Reconciliation: A Review

Preliminaries on Forecasting Reconciliation

Why Use Forecasting Reconciliation?

Some Notation

Bottom-Up Approach

Advanced Reconciliation Methods

Key Challenges and Tools

TSMixer: An All-MLP Architecture for Time Series Forecasting

CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting (ICLR 2024)

Dual Channel Attention

Token Blend Module

Signal Decay-Based Loss Function