NLP | Towards Data Science

Show and Tell

Muhammad Ardi — Mon, 03 Feb 2025 16:30:24 +0000

Introduction

Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision, directs me to a completely different world. Over time, we can now observe that AI becomes more and more advanced, with the intersection between multiple fields of study getting more common, including the two I just mentioned.

Today, many language models have capability to generate images based on the given prompt. That’s one example of the bridge between NLP and Computer Vision. But I guess I’ll save it for my upcoming article as it is a bit more complex. Instead, in this article I am going to discuss the simpler one: image captioning. As the name suggests, this is essentially a technique where a specific model accepts an image and returns a text that describes the input image.

One of the earliest papers in this topic is the one titled "Show and Tell: A Neural Image Caption Generator" written by Vinyals et al. back in 2015 [1]. In this article, I will focus on implementing the Deep Learning model proposed in the paper using PyTorch. Note that I won’t actually demonstrate the training process here as that’s a topic on its own. Let me know in the comments if you want a separate tutorial on that.

Image Captioning Framework

Generally speaking, image captioning can be done by combining two types of models: the one specialized to process images and another one capable of processing sequences. I believe you already know what kind of models work best for the two tasks – yes, you’re right, those are CNN and RNN, respectively. The idea here is that the CNN is utilized to encode the input image (hence this part is called encoder), whereas the RNN is used for generating a sequence of words based on the features encoded by the CNN (hence the RNN part is called decoder).

It is discussed in the paper that the authors attempted to do so using GoogLeNet (a.k.a., Inception V1) for the encoder and LSTM for the decoder. In fact, the use of GoogLeNet is not explicitly mentioned, yet based on the illustration provided in the paper it seems like the architecture used in the encoder is adopted from the original GoogLeNet paper [2]. The figure below shows what the proposed architecture looks like.

Figure 1. The image captioning model proposed in [1], where the encoder part (the leftmost block) implements the GoogLeNet model [2].

Talking more specifically about the connection between the encoder and the decoder, there are several methods available for connecting the two, namely init-inject, pre-inject, par-inject and merge, as mentioned in [3]. In the case of the Show and Tell paper, authors used pre-inject, a method where the features extracted by the encoder are perceived as the 0th word in the caption. Later in the inference phase, we expect the decoder to generate a caption based solely on these image features.

Figure 2. The four methods possible to be used to connect the encoder and the decoder part of an image captioning model [3]. In our case we are going to use the pre-inject method (b).

As we already understood the theory behind the image captioning model, we can now jump into the code!

Implementation

I’ll break the implementation part into three sections: the Encoder, the Decoder, and the combination of the two. Before we actually get into them, we need to import the modules and initialize the required parameters in advance. Look at the Codeblock 1 below to see the modules I use.

# Codeblock 1
import torch  #(1)
import torch.nn as nn  #(2)
import torchvision.models as models  #(3)
from torchvision.models import GoogLeNet_Weights  #(4)

Let’s break down these imports quickly: the line marked with #(1) is used for basic operations, line #(2) is for initializing neural network layers, line #(3) is for loading various deep learning models, and #(4) is the pretrained weights for the GoogLeNet model.

Talking about the parameter configuration, EMBED_DIM and LSTM_HIDDEN_DIM are the only two parameters mentioned in the paper, which are both set to 512 as shown at line #(1) and #(2) in the Codeblock 2 below. The EMBED_DIM variable essentially indicates the feature vector size representing a single token in the caption. In this case, we can simply think of a single token as an individual word. Meanwhile, LSTM_HIDDEN_DIM is a variable representing the hidden state size inside the LSTM cell. This paper does not mention how many times this RNN-based layer is repeated, but based on the diagram in Figure 1, it seems like it only implements a single LSTM cell. Thus, at line #(3) I set the NUM_LSTM_LAYERS variable to 1.

# Codeblock 2
EMBED_DIM       = 512    #(1)
LSTM_HIDDEN_DIM = 512    #(2)
NUM_LSTM_LAYERS = 1      #(3)

IMAGE_SIZE      = 224    #(4)
IN_CHANNELS     = 3      #(5)

SEQ_LENGTH      = 30     #(6)
VOCAB_SIZE      = 10000  #(7)

BATCH_SIZE      = 1

The next two parameters are related to the input image, namely IMAGE_SIZE (#(4)) and IN_CHANNELS (#(5)). Since we are about to use GoogLeNet for the encoder, we need to match it with its original input shape (3×224×224). Not only for the image, but we also need to configure the parameters for the caption. Here we assume that the caption length is no more than 30 words (#(6)) and the number of unique words in the dictionary is 10000 (#(7)). Lastly, the BATCH_SIZE parameter is used because by default PyTorch processes tensors in a batch. Just to make things simple, the number of image-caption pair within a single batch is set to 1.

GoogLeNet Encoder

It is actually possible to use any kind of CNN-based model for the encoder. I found on the internet that [4] uses DenseNet, [5] uses Inception V3, and [6] utilizes ResNet for the similar tasks. However, since my goal is to reproduce the model proposed in the paper as closely as possible, I am using the pretrained GoogLeNet model instead. Before we get into the encoder implementation, let’s see what the GoogLeNet architecture looks like using the following code.

# Codeblock 3
models.googlenet()

The resulting output is very long as it lists literally all layers inside the architecture. Here I truncate the output since I only want you to focus on the last layer (the fc layer marked with #(1) in the Codeblock 3 Output below). You can see that this linear layer maps a feature vector of size 1024 into 1000. Normally, in a standard image classification task, each of these 1000 neurons corresponds to a specific class. So, for example, if you want to perform a 5-class classification task, you would need to modify this layer such that it projects the outputs to 5 neurons only. In our case, we need to make this layer produce a feature vector of length 512 (EMBED_DIM). With this, the input image will later be represented as a 512-dimensional vector after being processed by the GoogLeNet model. This feature vector size will exactly match with the token embedding dimension, allowing it to be treated as a part of our word sequence.

# Codeblock 3 Output
GoogLeNet(
  (conv1): BasicConv2d(
    (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (maxpool1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True)
  (conv2): BasicConv2d(
    (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )

  .
  .
  .
  .

  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=1024, out_features=1000, bias=True)  #(1)
)

Now let’s actually load and modify the GoogLeNet model, which I do in the InceptionEncoder class below.

# Codeblock 4a
class InceptionEncoder(nn.Module):
    def __init__(self, fine_tune):  #(1)
        super().__init__()
        self.googlenet = models.googlenet(weights=GoogLeNet_Weights.IMAGENET1K_V1)  #(2)
        self.googlenet.fc = nn.Linear(in_features=self.googlenet.fc.in_features,  #(3)
                                      out_features=EMBED_DIM)  #(4)

        if fine_tune == True:       #(5)
            for param in self.googlenet.parameters():
                param.requires_grad = True
        else:
            for param in self.googlenet.parameters():
                param.requires_grad = False

        for param in self.googlenet.fc.parameters():
            param.requires_grad = True

The first thing we do in the above code is to load the model using models.googlenet(). It is mentioned in the paper that the model is already pretrained on the ImageNet dataset. Thus, we need to pass GoogLeNet_Weights.IMAGENET1K_V1 into the weights parameter, as shown at line #(2) in Codeblock 4a. Next, at line #(3) we access the classification head through the fc attribute, where we replace the existing linear layer with a new one having the output dimension of 512 (EMBED_DIM) (#(4)). Since this GoogLeNet model is already trained, we don’t need to train it from scratch. Instead, we can either perform fine-tuning or transfer learning in order to adapt it to the image captioning task.

In case you’re not yet familiar with the two terms, fine-tuning is a method where we update the weights of the entire model. On the other hand, transfer learning is a technique where we only update the weights of the layers we replaced (in this case it’s the last fully-connected layer), while setting the weights of the existing layers non-trainable. To do so, I implement a flag named fine_tune at line #(1) which will let the model to perform fine-tuning whenever it is set to True (#(5)).

The forward() method is pretty straightforward since what we do here is simply passing the input image through the modified GoogLeNet model. See the Codeblock 4b below for the details. Additionally, here I also print out the tensor dimension before and after processing so that you can better understand how the InceptionEncoder model works.

# Codeblock 4b
    def forward(self, images):
        print(f'originalt: {images.size()}')
        features = self.googlenet(images)
        print(f'after googlenett: {features.size()}')

        return features

To test whether our decoder works properly, we can pass a dummy tensor of size 1×3×224×224 through the network as demonstrated in Codeblock 5. This tensor dimension simulates a single RGB image of size 224×224. You can see in the resulting output that our image now becomes a single-dimensional feature vector with the length of 512.

# Codeblock 5
inception_encoder = InceptionEncoder(fine_tune=True)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
features = inception_encoder(images)

# Codeblock 5 Output
original         : torch.Size([1, 3, 224, 224])
after googlenet  : torch.Size([1, 512])

LSTM Decoder

As we have successfully implemented the encoder, now that we are going to create the LSTM decoder, which I demonstrate in Codeblock 6a and 6b. What we need to do first is to initialize the required layers, namely an embedding layer (#(1)), the LSTM layer itself (#(2)), and a standard linear layer (#(3)). The first one (nn.Embedding) is responsible for mapping every single token into a 512 (EMBED_DIM)-dimensional vector. Meanwhile, the LSTM layer is going to generate a sequence of embedded tokens, where each of these tokens will be mapped into a 10000 (VOCAB_SIZE)-dimensional vector by the linear layer. Later on, the values contained in this vector will represent the likelihood of each word in the dictionary being chosen.

# Codeblock 6a
class LSTMDecoder(nn.Module):
    def __init__(self):
        super().__init__()

        #(1)
        self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
                                      embedding_dim=EMBED_DIM)
        #(2)
        self.lstm = nn.LSTM(input_size=EMBED_DIM, 
                            hidden_size=LSTM_HIDDEN_DIM, 
                            num_layers=NUM_LSTM_LAYERS, 
                            batch_first=True)
        #(3)        
        self.linear = nn.Linear(in_features=LSTM_HIDDEN_DIM, 
                                out_features=VOCAB_SIZE)

Next, let’s define the flow of the network using the following code.

# Codeblock 6b
    def forward(self, features, captions):                 #(1)
        print(f'features originalt: {features.size()}')
        features = features.unsqueeze(1)                   #(2)
        print(f"after unsqueezett: {features.shape}")

        print(f'captions originalt: {captions.size()}')
        captions = self.embedding(captions)                #(3)
        print(f"after embeddingtt: {captions.shape}")

        captions = torch.cat([features, captions], dim=1)  #(4)
        print(f"after concattt: {captions.shape}")

        captions, _ = self.lstm(captions)                  #(5)
        print(f"after lstmtt: {captions.shape}")

        captions = self.linear(captions)                   #(6)
        print(f"after lineartt: {captions.shape}")

        return captions

You can see in the above code that the forward() method of the LSTMDecoder class accepts two inputs: features and captions, where the former is the image that has been processed by the InceptionEncoder, while the latter is the caption of the corresponding image serving as the ground truth (#(1)). The idea here is that we are going to perform pre-inject operation by prepending the features tensor into captions using the code at line #(4). However, keep in mind that we need to adjust the shape of both tensors beforehand. To do so, we have to insert a single dimension at the 1st axis of the image features (#(2)). Meanwhile, the shape of the captions tensor will align with our requirement right after being processed by the embedding layer (#(3)). As the features and captions have been concatenated, we then pass this tensor through the LSTM layer (#(5)) before it is eventually processed by the linear layer (#(6)). Look at the testing code below to better understand the flow of the two tensors.

# Codeblock 7
lstm_decoder = LSTMDecoder()

features = torch.randn(BATCH_SIZE, EMBED_DIM)  #(1)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(2)

captions = lstm_decoder(features, captions)

In Codeblock 7, I assume that features is a dummy tensor that represents the output of the InceptionEncoder model (#(1)). Meanwhile, captions is the tensor representing a sequence of tokenized words, where in this case I initialize it as random numbers ranging between 0 to 10000 (VOCAB_SIZE) with the length of 30 (SEQ_LENGTH) (#(2)).

We can see in the output below that the features tensor initially has the dimension of 1×512 (#(1)). This tensor shape changed to 1×1×512 after being processed with the unsqueeze() operation (#(2)). The additional dimension in the middle (1) allows the tensor to be treated as a feature vector corresponding to a single timestep, which is necessary for compatibility with the LSTM layer. To the captions tensor, its shape changed from 1×30 (#(3)) to 1×30×512 (#(4)), indicating that every single word is now represented as a 512-dimensional vector.

# Codeblock 7 Output
features original : torch.Size([1, 512])       #(1)
after unsqueeze   : torch.Size([1, 1, 512])    #(2)
captions original : torch.Size([1, 30])        #(3)
after embedding   : torch.Size([1, 30, 512])   #(4)
after concat      : torch.Size([1, 31, 512])   #(5)
after lstm        : torch.Size([1, 31, 512])   #(6)
after linear      : torch.Size([1, 31, 10000]) #(7)

After pre-inject operation is performed, our tensor is now having the dimension of 1×31×512, where the features tensor becomes the token at the 0th timestep in the sequence (#(5)). See the following figure to better illustrate this idea.

Figure 3. What the resulting tensor looks like after the pre-injection operation. [3].

Next, we pass the tensor through the LSTM layer, which in this particular case the output tensor dimension remains the same. However, it is important to note that the tensor shapes at line #(5) and #(6) in the above output are actually specified by different parameters. The dimensions appear to match here because EMBED_DIM and LSTM_HIDDEN_DIM were both set to 512. Normally, if we use a different value for LSTM_HIDDEN_DIM, then the output dimension is going to be different as well. Finally, we projected each of the 31 token embeddings to a vector of size 10000, which will later contain the probability of every possible token being predicted (#(7)).

GoogLeNet Encoder + LSTM Decoder

At this point, we have successfully created both the encoder and the decoder parts of the image captioning model. What I am going to do next is to combine them together in the ShowAndTell class below.

# Codeblock 8a
class ShowAndTell(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = InceptionEncoder(fine_tune=True)  #(1)
        self.decoder = LSTMDecoder()     #(2)

    def forward(self, images, captions):
        features = self.encoder(images)  #(3)
        print(f"after encodert: {features.shape}")

        captions = self.decoder(features, captions)      #(4)
        print(f"after decodert: {captions.shape}")

        return captions

I think the above code is pretty straightforward. In the __init__() method, we only need to initialize the InceptionEncoder as well as the LSTMDecoder models (#(1) and #(2)). Here I assume that we are about to perform fine-tuning rather than transfer learning, so I set the fine_tune parameter to True. Theoretically speaking, fine-tuning is better than transfer learning if you have a relatively large dataset since it works by re-adjusting the weights of the entire model. However, if your dataset is rather small, you should go with transfer learning instead – but that’s just the theory. It’s definitely a good idea to experiment with both options to see which works best in your case.

Still with the above codeblock, we configure the forward() method to accept image-caption pairs as input. With this configuration, we basically design this method such that it can only be used for training purpose. Here we initially process the raw image with the GoogLeNet inside the encoder block (#(3)). Afterwards, we pass the extracted features as well as the tokenized captions into the decoder block and let it produce another token sequence (#(4)). In the actual training, this caption output will then be compared with the ground truth to compute the error. This error value is going to be used to compute gradients through backpropagation, which determines how the weights in the network are updated.

It is important to know that we cannot use the forward() method to perform inference, so we need a separate one for that. In this case, I am going to implement the code specifically to perform inference in the generate() method below.

# Codeblock 8b
    def generate(self, images):  #(1)
        features = self.encoder(images)              #(2)
        print(f"after encodertt: {features.shape}n")

        words = []  #(3)
        for i in range(SEQ_LENGTH):                  #(4)
            print(f"iteration #{i}")
            features = features.unsqueeze(1)
            print(f"after unsqueezett: {features.shape}")

            features, _ = self.decoder.lstm(features)
            print(f"after lstmtt: {features.shape}")

            features = features.squeeze(1)           #(5)
            print(f"after squeezett: {features.shape}")

            probs = self.decoder.linear(features)    #(6)
            print(f"after lineartt: {probs.shape}")

            _, word = probs.max(dim=1)  #(7)
            print(f"after maxtt: {word.shape}")

            words.append(word.item())  #(8)

            if word == 1:  #(9)
                break

            features = self.decoder.embedding(word)  #(10)
            print(f"after embeddingtt: {features.shape}n")

        return words       #(11)

Instead of taking two inputs like the previous one, the generate() method takes raw image as the only input (#(1)). Since we want the features extracted from the image to be the initial input token, we first need to process the raw input image with the encoder block prior to actually generating the subsequent tokens (#(2)). Next, we allocate an empty list for storing the token sequence to be produced later (#(3)). The tokens themselves are generated one by one, so we wrap the entire process inside a for loop, which is going to stop iterating once it reaches at most 30 (SEQ_LENGTH) words (#(4)).

The steps done inside the loop is algorithmically similar to the ones we discussed earlier. However, since the LSTM cell here generates a single token at a time, the process requires the tensor to be treated a bit differently from the one passed through the forward() method of the LSTMDecoder class back in Codeblock 6b. The first difference you might notice is the squeeze() operation (#(5)), which is basically just a technical step to be done such that the subsequent layer does the linear projection correctly (#(6)). Then, we take the index of the feature vector having the highest value, which corresponds to the token most likely to come next (#(7)), and append it to the list we allocated earlier (#(8)). The loop is going to break whenever the predicted index is a stop token, which in this case I assume that this token is at the 1st index of the probs vector. Otherwise, if the model does not find the stop token, then it is going to convert the last predicted word into its 512 (EMBED_DIM)-dimensional vector (#(10)), allowing it to be used as the input features for the next iteration. Lastly, the generated word sequence will be returned once the loop is completed (#(11)).

We are going to simulate the forward pass for the training phase using the Codeblock 9 below. Here I pass two tensors through the show_and_tell model (#(1)), each representing a raw image of size 3×224×224 (#(2)) and a sequence of tokenized words (#(3)). Based on the resulting output, we found that our model works properly as the two input tensors successfully passed through the InceptionEncoder and the LSTMDecoder part of the network.

# Codeblock 9
show_and_tell = ShowAndTell()  #(1)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))      #(3)

captions = show_and_tell(images, captions)

# Codeblock 9 Output
after encoder : torch.Size([1, 512])
after decoder : torch.Size([1, 31, 10000])

Now, let’s assume that our show_and_tell model is already trained on an image captioning dataset, and thus ready to be used for inference. Look at the Codeblock 10 below to see how I do it. Here we set the model to eval() mode (#(1)), initialize the input image (#(2)), and pass it through the model using the generate() method (#(3)).

# Codeblock 10
show_and_tell.eval()  #(1)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)

with torch.no_grad():
    generated_tokens = show_and_tell.generate(images)  #(3)

The flow of the tensor can be seen in the output below. Here I truncate the resulting outputs because it only shows the same token generation process 30 times.

# Codeblock 10 Output
after encoder    : torch.Size([1, 512])

iteration #0
after unsqueeze  : torch.Size([1, 1, 512])
after lstm       : torch.Size([1, 1, 512])
after squeeze    : torch.Size([1, 512])
after linear     : torch.Size([1, 10000])
after max        : torch.Size([1])
after embedding  : torch.Size([1, 512])

iteration #1
after unsqueeze  : torch.Size([1, 1, 512])
after lstm       : torch.Size([1, 1, 512])
after squeeze    : torch.Size([1, 512])
after linear     : torch.Size([1, 10000])
after max        : torch.Size([1])
after embedding  : torch.Size([1, 512])

.
.
.
.

To see what the resulting caption looks like, we can just print out the generated_tokens list as shown below. Keep in mind that this sequence is still in the form of tokenized words. Later, in the post-processing stage, we will need to convert them back to the words corresponding to these numbers.

# Codeblock 11
generated_tokens

# Codeblock 11 Output
[5627,
 3906,
 2370,
 2299,
 4952,
 9933,
 402,
 7775,
 602,
 4414,
 8667,
 6774,
 9345,
 8750,
 3680,
 4458,
 1677,
 5998,
 8572,
 9556,
 7347,
 6780,
 9672,
 2596,
 9218,
 1880,
 4396,
 6168,
 7999,
 454]

Ending

With the above output, we’ve reached the end of our discussion on image captioning. Over time, many other researchers attempted to make improvements to accomplish this task. So, I think in the upcoming article I will discuss the state-of-the-art method on this topic.

Thanks for reading, I hope you learn something new today!

_By the way you can also find the code used in this article here._

References

[1] Oriol Vinyals et al. Show and Tell: A Neural Image Caption Generator. Arxiv. https://arxiv.org/pdf/1411.4555 [Accessed November 13, 2024].

[2] Christian Szegedy et al. Going Deeper with Convolutions. Arxiv. https://arxiv.org/pdf/1409.4842 [Accessed November 13, 2024].

[3] Marc Tanti et al. Where to put the Image in an Image Caption Generator. Arxiv. https://arxiv.org/pdf/1703.09137 [Accessed November 13, 2024].

[4] Stepan Ulyanin. Captioning Images with CNN and RNN, using PyTorch. Medium. https://medium.com/@stepanulyanin/captioning-images-with-pytorch-bc592e5fd1a3 [Accessed November 16, 2024].

[5] Saketh Kotamraju. How to Build an Image-Captioning Model in Pytorch. Towards Data Science. https://towardsdatascience.com/how-to-build-an-image-captioning-model-in-pytorch-29b9d8fe2f8c [Accessed November 16, 2024].

[6] Code with Aarohi. Image Captioning using CNN and RNN | Image Captioning using Deep Learning. YouTube. https://www.youtube.com/watch?v=htNmFL2BG34 [Accessed November 16, 2024].

The post Show and Tell appeared first on Towards Data Science.

NLP Illustrated, Part 3: Word2Vec

Shreya Rao — Wed, 29 Jan 2025 17:01:57 +0000

Welcome to Part 3 of our illustrated journey through the exciting world of Natural Language Processing! If you caught Part 2, you’ll remember that we chatted about word embeddings and why they’re so cool.

NLP Illustrated, Part 2: Word Embeddings

Word embeddings allow us to create maps of words that capture their nuances and intricate relationships.

This article will break down the math behind building word embeddings using a technique called Word2Vec – a Machine Learning model specifically designed to generate meaningful word embeddings.

Word2Vec offers two methods – Skip-gram and CBOW – but we’ll focus on how the Skip-gram method works, as it’s the most widely used.

These words and concepts might sound complex right now but don’t worry – at its core, it’s just some intuitive math (and a sprinkle of machine learning magic).

Real quick – before diving into this article, I strongly encourage you to read my series on the basics of machine learning. A couple of concepts (like gradient descent and loss functions) build on those fundamentals, and understanding them will make this article much easier to follow.

Machine Learning Starter Pack

That said, don’t worry if you’re unfamiliar with those concepts – this article will cover them at a high level to ensure you can still follow along!

Since Word2Vec is a machine-learning model, like any ML model, it needs two things:

Training data: text data to learn from
A problem statement: the question the model is trying to answer

Training data

We’re trying to create a map of words, so our training data is going to be text. Let’s start with this sentence:

This will be our toy training data. Of course, in the real world, Word2Vec is trained on massive corpora of text – think entire books, Wikipedia, or large collections of websites. For now though, we’re keeping it simple with just this one sentence, so the model will only learn embeddings for these 18 words.

A problem statement

For Word2Vec, the core problem is simple: Given two words, determine whether they are neighbors

To define "neighbors," we use something called a context window, which specifies how many neighboring words on either side to consider.

For instance, if we want to find the neighbors of the word "happiness"…

…and set the context window size to 2, the neighbors of "happiness" will be "can" and "be".

And here, if we input "happiness" and "can" into the model, ideally we want it to predict that they are neighbors.

Similarly, for the word "darkness," with a context window of 2, the neighbors would be "in" and "the" (before), and "of" and "times" (after).

If we set our context window to 3, the neighbors for "happiness" will be three words on either side.

Terminology segway: Here "happiness" is referred to as the target word, while the neighboring words are known as the context words.

By default, the context window size in Word2Vec is set to 5. However, for simplicity in our example, we’ll use a context window size of 2.

Now, we need to convert this sentence into a neat little table, just like we do for other machine learning problems, with clearly defined inputs and output values.

We can construct this dataset by pairing the target word with each of its context words as inputs…

…and the output will be a label indicating whether the target and context words are neighbors:

1 indicates that they are neighbors

But there’s a glaring issue with this. All our training pairs are positive examples (neighbors), which doesn’t teach the model what non-neighbors look like.

Enter Negative Sampling.

Negative Sampling introduces pairs of words that are not neighbors. So for instance, we know that "happiness" and "light" are not neighbors, so we add that data to our training data with the label 0 to indicate that they are not neighbors.

By adding negative samples, the final dataset contains a mix of positive and negative pairs so that the model can learn to predict whether a given pair is a true neighbor or not.

Typically, we use 2 — 5 negative samples per positive pair for large datasets and up to 10 for smaller ones.

We’ll use 2 negative pairs per positive pair. Our training dataset now looks like this:

Now comes the fun part – the machine learning magic. Here’s the problem we’re solving: Given a target word and a context word, predict the probability that they are neighbors.

Let’s break it down step by step.

Step 0: Decide embedding dimensions

The first thing we do is to decide the size of the word embeddings. As we’ve learned, larger embeddings capture more nuances and richer relationships but come at the cost of increased computational expense.

The default embedding size in Word2Vec is 100 dimensions, but to keep the explanation simple, let’s use just 2 dimensions.

This means each word will be represented as a point on a 2D graph like so:

Step 1: Initialize embedding matrices

Next, we initialize two distinct sets of embeddings – target embeddings and context embeddings.

And, at the start of training, these embeddings are randomly initialized with values:

The target embeddings and context embeddings are randomly initialized with different values because they serve distinct purposes.

Target Embeddings: Represent each word when it’s the target word in training
Context Embeddings: Represent each word when it’s a context (neighboring) word

Step 2: Calculate the similarity of target word and context word

In the training process, we work with blocks of one positive pair and their corresponding negative samples.

So in the first pass, we only focus on the first positive pair and its corresponding 2 negative samples.

Now we can determine how similar 2 words are by calculating the dot product of their embeddings: the target embedding (if its a target word) and the context embedding (if its a context word).

A larger dot product indicates the words are more "similar" (likely neighbors)
A smaller dot product suggests they are more dissimilar (less likely to be neighbors)

And remember, in the first pass, we only calculate the similarity of the 3 pairs in the first block.

Let’s start by taking the dot product of the target word embedding of "happiness" with the context word embedding of "can":

We get:

Now we need to find a way to convert these scores to probabilities because we want to know how likely is it that these two words are neighbors. We can do that by passing this dot product through a sigmoid function.

As a quick refresher, the sigmoid function squishes any input value into a range between 0 and 1, making it perfect for interpreting probabilities. If the dot product is large (indicating high similarity), the sigmoid output will be close to 1 and if the dot product is small (indicating low similarity), the sigmoid output will be closer to 0.

So passing the dot product, -0.36, through the sigmoid function, we get:

Similarly, we can calculate the dot product and corresponding probabilities for the other two pairs…

…to get the predicted probability that "happiness" and "light" are neighbors…

…and the predicted probability that "happiness" and "even" are neighbors:

This is how we calculate the model’s predicted probabilities of these 3 pairs being neighbors.

As we can see, the predicted values are pretty random and inaccurate, which makes sense because the embeddings were initialized with random values.

Next, we move on to the key step: updating these embeddings to improve the predictions.

Step 4: Calculate error

NOTE: If you haven’t read the article on Logistic Regression, it might be helpful to do so, as the process of calculating error there is very similar. But don’t worry, we’ll also go over the basics here.

Now that we have our predictions, we need to calculate the "error" value to measure how far off the model’s predictions are from the true labels. For this, we use the Log Loss function.

For every prediction, the error is calculated as:

And the overall Log Loss for all predictions in the block is the average of the individual prediction errors:

For our example, if we calculate the loss for the 3 pairs above, it will look like this:

Evaluating this…

…we get 0.3. Our goal is to reduce this loss to 0 or as close to 0 as possible. A loss of 0 means that the model’s predictions perfectly match the true labels.

Step 4: Update embeddings using gradient descent

Again won’t dive into the details here since we covered this in our previous article on Logistic Regression. However, we know that the best way to minimize the loss function is by using gradient descent.

To put it simply, Log Loss is a convex function…

…and gradient descent helps us find the lowest point on this curve – the point where the loss is minimized.

It does so by:

calculating the gradient (the slope) of the loss function with respect to the embeddings and
adjusting the embeddings slightly in the opposite direction of the gradient to reduce the loss

So once gradient descent works its magic, we get new embeddings like so:

Let’s visualize this change. We start with our target embedding ("hapiness") and context embedding ("can", "light" and "even") in our block.

And after gradient descent, they shift slightly like so:

This is the REAL magic of this step. We see that automatically:

for the positive pair, the target embedding of"happiness" is nudged closer to the context embedding of "can," its neighbor
and for the negative pairs, the target embedding ("happiness") is adjusted to move further away from the non-neighboring context embeddings of "light" and "even"

Step 5: Repeat steps 2–4

Now all we have to do is rinse and repeat steps 2- 4 using the next block of positive and negative pairs.

Let’s see what this looks like for the second block.

For these values, we determine the model’s predictions of whether the words are neighbors or not by:

(1) Taking dot products and passing them through the sigmoid function…

(2) And then using the Log Loss and gradient descent we update the target and context embedding values for the words in this block:

Again, doing so will nudge the neighboring word embedding closer together and dissimilar ones are pushed farther apart.

That’s pretty much it. We just repeat these steps with each block in our training data.

Sidenote: Going through all blocks in the training dataset once is called an epoch. We usually repeat this for 5–20 epochs for a super robust training process.

By the end of our full training process, we’ll get up with our final target and embeddings that look something like this:

If we get rid of the context embedding, we are left with just the final target embeddings.

And these final target embeddings are the word embedding that we were after at the beginning!!

SIDENOTE: If needed, the context embeddings could be averaged or combined with the target embeddings to create a hybrid representation. However, this is rare and not standard practice.

This happens because the training process refines embeddings based on word relationships. Similar words (neighbors) are pulled closer together, while dissimilar words (non-neighbors) are pushed apart. While doing so, it also ends up capturing deeper relationships between words, including synonyms, analogies, and subtle contextual similarities.

Here, our training data was just a single sentence with 18 words, so the embeddings may not seem meaningful. But imagine training on a massive corpus – an entire book, a collection of articles, or billions of sentences from the web.

And that’s it! That’s how we create word embeddings using Word2Vec, specifically the skip-gram method.

Word2Vec IRL

Now that we’ve unpacked the mathematical magic behind Word2Vec, let’s bring it to life and create our own word embeddings.

Use pre-trained word embeddings

The easiest and most efficient way to get started is to use pre-trained word embeddings. These embeddings are already trained on massive datasets like Google News and Wikipedia, so they’re incredibly robust. This means we don’t have to start from scratch, saving both time and computational resources.

We leverage some pre-trained Word2Vec embeddings using Gensim, a popular Python library for NLP that’s optimized for handling large-scale text processing tasks.

# install gensim 
# !pip install --upgrade gensim

import gensim.downloader as api

Let’s look at all available pre-trained Word2Vec models in Gensim:

available_models = api.info()['models']

print("Available pre-trained Word2Vec models in Gensim:n")
for model_name, details in available_models.items():
    if 'word2vec' in model_name.lower():  # find models with 'word2vec' in their name
        print(f"Model: {model_name}")
        print(f"  - Description: {details.get('description')}")

Available pre-trained Word2Vec models in Gensim:

Model: word2vec-ruscorpora-300
  - Description: Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words.
Model: word2vec-google-news-300
  - Description: Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/).
Model: __testing_word2vec-matrix-synopsis
  - Description: [THIS IS ONLY FOR TESTING] Word vecrors of the movie matrix.

We see that there are two usable pre-trained models (since one of the models is labeled test). Lets’ put the word2vec-google-news-300 model to test!

Here’s how to find synonyms of the word "beautiful":

w2v_google_news.most_similar("king")

[('gorgeous', 0.8353005051612854),
 ('lovely', 0.8106936812400818),
 ('stunningly_beautiful', 0.7329413294792175),
 ('breathtakingly_beautiful', 0.7231340408325195),
 ('wonderful', 0.6854086518287659),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402888298035)]

These all make sense.

If you recall from the previous article, we saw how we can perform mathematical operations on word embeddings to get intuitive results. One of the most popular examples of this is…

…which we can test like so:

# king + woman - man
w2v_google_news.most_similar_cosmul(positive=['king', 'woman'], negative=['man'])

The results are impressively accurate!

Let’s try another combination:

# better + bad - good
w2v_google_news.most_similar_cosmul(positive=['better', 'bad'], negative=['good'])

[('worse', 0.9141383767127991),
 ('uglier', 0.8268526792526245),
 ('sooner', 0.7980951070785522),
 ('dumber', 0.7923389077186584),
 ('harsher', 0.791556715965271),
 ('stupider', 0.7884790301322937),
 ('scarier', 0.7865160703659058),
 ('angrier', 0.7857241034507751),
 ('differently', 0.7801468372344971),
 ('sorrier', 0.7758733034133911)]

And "worse" is the top match! Very cool.

As we can see, these pre-trained models are incredibly robust and can be leveraged for most use cases. However, they’re not perfect for every situation. For instance, if we’re working with niche domains like legal or medical texts, general-purpose embeddings may fail to capture the specific meanings and nuances of the language.

Say we have this legal text:

"The appellant seeks declaratory relief under Rule 57, asserting that the respondent’s fiduciary duty was breached by non-disclosure of material facts in accordance with Section 10(b) of the Securities Exchange Act of 1934."

Legal documents are often written in a formal, highly structured style, with terms like "Rule 57" or "Section 10(b)" referencing specific laws and statutes. Words like "material facts" have a precise legal meaning – facts that can influence the outcome of a case – which is very different from how "material" is understood in everyday language.

Pre-trained embeddings trained on general corpora, such as Google News, won’t capture these nuanced, domain-specific meanings. Instead, for tasks like this, we need embeddings trained on domain-specific corpora, such as legal judgments, statutes, or contracts.

Code our own Word2Vec from scratch

This is where building our own Word2Vec model is helpful. By training on a legal corpus, we can create embeddings tailored to our use case, capturing the relationships and meanings specific to the legal domain.

And just like that we’re done! You now know everything you need to know about Word2Vec.

As always, feel free to connect with me on LinkedIn or email me at shreya.Statistics@gmail.com!

Unless specified, all images are by the author.

The post NLP Illustrated, Part 3: Word2Vec appeared first on Towards Data Science.

Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code

Petr Korab — Wed, 22 Jan 2025 18:02:13 +0000

Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code

Source: Freepic, Image by rawpixel.com

Customer reviews about products and services provide valuable information about customer satisfaction. They provide insight into what should be improved across the whole product development. Dynamic topic models in business intelligence can identify key product qualities and other satisfaction factors, cluster them into categories, and evaluate how business decisions materialized in customer satisfaction over time. This is highly valuable information not only for product managers.

This article will compare two of the latest topic models to classify customer complaints data. Bertopic by Maarten Grootendorst (2022) and the recent FASTopic by Xiaobao Wu et al. (2024) presented at last year’s NeurIPS, are the current leading models for topic analytics of customer data. For these models, we’ll explore in Python code:

how to effectively preprocess data
how to train a Bigram topic model for customer complaint analysis
how to model topic activity over time.

1. Customer complaints data in companies

Complaints data are generated by interaction with customers and typically recorded in ERP systems. There are many channels where customers can raise a concern about a product or service. Here are just a few examples:

Email: email communication is stored for the BI team, e.g., in the SQL database.
After-purchase survey: feedback sent to customers after product purchase. Companies either send the emails themselves or use a price comparison website (e.g., Billiger in Germany) where customers order the product.
Phone transcriptions: after prior consent from a customer, some companies record the phone communication with customers, which is then available for the BI team.
Google reviews: customers leave comments and reviews on products and services worldwide. Google enables authorized users to export the data not only for Text Mining purposes.
Review platforms: independent review platforms **** (such as Trustpilot) offer customers a place to provide feedback to brands and companies. This data is available through various APIs.
Social media conversations: Instagram, X, and Facebook are full of product or brand-related comments. The simplest way is to use an official API to collect the data. For Instagram and Facebook, go to the developers’ portal to receive an API key. X works the same way.

2. Example data

As example data, we’ll use the Amazon Dog Food Reviews dataset from Hugging Face, released under the Apache-2.0 license. The subset for topic modeling only contains 3693 customer reviews collected over 02/01/2016: 31/12/2020. Here is what the data looks like:

Image 1. Amazon dog food reviews dataset

Image 2. General preprocessing steps for (customer feedback) topic modeling. Image by author

3. Data preprocessing

Processing data systematically in the right order keeps the essential information and does not add a new bias. Let’s go along these steps:

#1: Numbers: digits are typically the characters to remove in the first step.
#2: Emoticons: product reviews are typically full of them. For topic modeling in customer reviews, emojis don’t have much significance.
#3: Stopwords: apart from standard stopwords, it is common to remove an extended stopwords list for one or more languages.
#4: Punctuation: general language has a myriad of special characters and punctuation, which should be cleaned in this step.
#5: Additional stopwords: depending on the use case, some additional words are also useful to remove. With the Amazon dog food reviews, these are "dog", "food", "blue", "buffalo", "ha", "month", and "ago".

"Delivery" and "deliveries", "box" and "Boxes", or "Price" and "prices" share the same word root, but without lemmatization, topic models would model them as separate factors. That’s why product reviews should always be lemmatized in the last step of preprocessing.

#6: Lemmatization: groups words into a single form (the lemma), keeping the word root information and semantics.

Text preprocessing is model-specific:

FASTopic works with clean data on input; some cleaning (stopwords) can be done during the training. The simplest and most effective way is to use the Washer: The no-code app for text data cleaning offering a no-code way of processing data for text mining projects.
BERTopic: the documentation recommends that " removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings". It uses transformers based on real text, not clean text without stopwords, lemmas, or tokens. For this reason, cleaning operations should be included in the model training.

Source: Freepic, Image by macrovector

4. Topic modeling with top-notch models

Let’s now check how the satisfaction factors are distributed across the topics. The questions we ask here are:

What were the key problems and qualities customers reported on the product?
How has product satisfaction changed over time?

The BERTopic and FASTopic papers describe the model architectures in detail. Also, my TDS tutorial on topic modeling explains topic classification with BERTopic on a political speech dataset.

4.1. FASTopic

Import the libraries and the data (complete code and the requirements are here). Then, create a list of clean reviews:

import pandas as pd
from fastopic import FASTopic
from sklearn.feature_extraction.text import CountVectorizer 
from topmost.preprocessing import Preprocessing             

# create a list of reviews
docs = data['clean_text'].tolist()

In FASTopic, bigram generation is not directly implemented. To solve this, we will make a bigram preprocessing class. The model works with bigrams as with individual tokens, so we join the words in bigrams with underscores.

# custom preprocessing class with bigram generation
class NgramPreprocessing:
    def __init__(self, ngram_range=(1, 1), 
                       vocab_size=10000, 
                       stopwords='English'): 

        self.ngram_range = ngram_range
        self.preprocessing = Preprocessing(vocab_size=vocab_size, 
                                           stopwords=stopwords)

        # use a custom analyzer to join bigrams with "_"
        self.vectorizer = CountVectorizer(ngram_range=self.ngram_range, 
                                          max_features=vocab_size, 
                                          analyzer=self._custom_analyzer)

        # custom analyzer function to join bigrams with underscores
    def _custom_analyzer(self, doc):
        # tokenize the document and create bigrams
        tokens = CountVectorizer(ngram_range=self.ngram_range).build_analyzer()(doc)

        # replace spaces in bigrams with "_"
        return [token.replace(" ", "_") for token in tokens]

    def preprocess(self, 
                   docs, 
                   pretrained_WE=False):

        parsed_docs = self.preprocessing.preprocess(docs, 
                      pretrained_WE=pretrained_WE)["train_texts"]
        train_bow = self.vectorizer.fit_transform(parsed_docs).toarray()
        rst = {
            "train_bow": train_bow,
            "train_texts": parsed_docs,
            "vocab": self.vectorizer.get_feature_names_out()
        }
        return rst

# initialize preprocessing with bigrams
ngram_preprocessing = NgramPreprocessing(ngram_range=(2, 2))

Let’s train the model for eight topics and display the top 20 bigrams for each topic in a data frame. We train on single tokens, then remove the underscores generating the bigrams.

# model training
model = FASTopic(8, ngram_preprocessing, num_top_words=10000)

# fit model to documents
topic_top_words, doc_topic_dist = model.fit_transform(docs)

# retrieve 20 bigrams for each topic
import pandas as pd

max_bigrams = 20

# Retrieve the bigrams for each topic and select only the word columns
topic_0 = pd.DataFrame(model.get_topic(0, max_bigrams), columns=["Topic_0_word", "Topic_0_prob"])[["Topic_0_word"]]
topic_1 = pd.DataFrame(model.get_topic(1, max_bigrams), columns=["Topic_1_word", "Topic_1_prob"])[["Topic_1_word"]]
topic_2 = pd.DataFrame(model.get_topic(2, max_bigrams), columns=["Topic_2_word", "Topic_2_prob"])[["Topic_2_word"]]
topic_3 = pd.DataFrame(model.get_topic(3, max_bigrams), columns=["Topic_3_word", "Topic_3_prob"])[["Topic_3_word"]]
topic_4 = pd.DataFrame(model.get_topic(4, max_bigrams), columns=["Topic_4_word", "Topic_4_prob"])[["Topic_4_word"]]
topic_5 = pd.DataFrame(model.get_topic(5, max_bigrams), columns=["Topic_5_word", "Topic_5_prob"])[["Topic_5_word"]]
topic_6 = pd.DataFrame(model.get_topic(6, max_bigrams), columns=["Topic_6_word", "Topic_6_prob"])[["Topic_6_word"]]
topic_7 = pd.DataFrame(model.get_topic(7, max_bigrams), columns=["Topic_7_word", "Topic_7_prob"])[["Topic_7_word"]]

# concatenate the dataframes
topics_df = pd.concat([topic_0,topic_1, topic_2, topic_3, topic_4,topic_5,topic_6,topic_7], axis=1)

# remove underscores from bigrams
topics_df = topics_df.applymap(lambda x: x.replace('_', ' ') if isinstance(x, str) else x)

We’ve modeled the customer satisfaction factors with a dog food product in eight distinct topics. Here are the manually annotated topic names:

Image 3: Satisfaction factors modeling with FASTopic. Image by author

FASTopic returns relatively distinct topics, sorting the comments of the customers:

0: Negative health effects, "sensitive stomach", "small bite", "stomach issue", "lose weight", "refuse eat", "taste wild", "digestive issue", "upset stomach", "stop eat", "gain weight"
1: Food quality, "love flavor", "quality ingredient", "good ingredient", "healthy ingredient", "ingredient quality", "flavor good", "taste great", "healthy love", "great healthy", "good healthy", "good health", …
2: Positive health effects, "healthy fur", "awesome pup", "eye bright"
3: Digestion effects, "smell bad", "runny poop", "horrible gas", "diarrhea vet", "terrible diarrhea," "sick week", "sick buy", "day vomit"
4: Pricing, "great price", "good price", "love price", "price great", "love cheap", "price deliver", "great deal", "price increase", "free shipping", …
5: Other, other factors.
6: Fur effects, "coat shiny", "fur baby", "skin issue", "shiny coat", "love coat", "coat soft"
7: Delivery, "open box", "bag rip", "big bag", "hole bag", "open bag", "inside box", "bag open", "bag hole", "heavy bag", "rip open", …

It is also useful to check the weight of these categories in the data. The full code is here.

We’ve modeled the customer satisfaction factors with a dog food product. But why is it beneficial for companies? Dynamic topic models offer a straightforward way of monitoring customer satisfaction over time. They indicate product-related problems and help take the right measures. Once the business decisions are taken into action, topic models check if they have an effect over time.

To do so, let’s model topic activity over time at a quarterly frequency.

import plotly.graph_objects as go

# convert date column to datetime
data['time'] = pd.to_datetime(data['time'])

# format date column to quarterly periods
data['date_quarterly'] = data['time'].dt.to_period('Q').astype(str)

periods = data['date_quarterly'].tolist()

# calculate topic activity over time
act = model.topic_activity_over_time(periods)

# visualize topic activity
fig = model.visualize_topic_activity(top_n=8, topic_activity=act, time_slices=periods)

# update legend to display only the topic number
fig.data = sorted(fig.data, key=lambda trace: trace.name)

for trace in fig.data:
    trace.name = trace.name[0]

# update the layout
fig.update_layout(
    width=1200,
    height=600,
    title='',
    legend_title_text='Topic',
    xaxis_tickangle=45         # set x-axis labels to 45-degree angle
)

# show the figure
fig.show()

The delivery problems in topic 7 peaked in Q3 2018. Customers complained about open and rip boxes much more often, but these problems were fixed in early 2019 (see the picture below).

Image 4: Topic activity over time, FASTopic. Image by author.

4.2. BERTopic

BERTopic implements bigrams with _vectorizer_model,_ which also works as a data processing pipeline. The code and the requirements are here.

from bertopic import BERTopic
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import nltk
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re

nltk.download('stopwords')

# create a list of speeches
docs = data['text'].tolist()

We train on raw data and clean it with the vectorizer. During the training, the vectorizer cleans data from numbers and stopwords, returning lemmatized tokens for the bigram model.

# create stopwords list
standard_stopwords = list(stopwords.words('english'))

# extended list of English stopwords
stopwords_extended = [ "0o",  ..]      

# additional tokens to remove
additional_stopwords = ['blue','buffalo','dog','food','ha','month','ago'] 

# combine standard, extended stopwords, and additional tokens
full_stopwords = standard_stopwords 
                 + additional_stopwords
                 + stopwords_extended

# define tokenizer retrurning lemmatized text without numbers
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        doc = re.sub(r'd+', '', doc)  # clean numbers
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)] # lemmatize

# vectorizer makes data processing and generates bigrams
vectorizer_model = CountVectorizer(tokenizer=LemmaTokenizer(),
                                  ngram_range=(2, 2),
                                  stop_words=full_stopwords)

# set-up model
model = BERTopic(n_gram_range=(2,2), # returns bigrams
                nr_topics=9,         # generate 9 topics, leave -1 for outliers
                top_n_words=20,      # return top 20 bigrams
                min_topic_size=20,   # topics contains at least 20 tokens
                vectorizer_model=vectorizer_model,
                umap_model = UMAP(random_state=1))  # setting seed topics reproduce

# fit model to data
topics, probabilities = model.fit_transform(docs)

Next, let’s prepare a dataframe with tokens from the model.

import pandas as pd

# retrieve bigrams for each topic and select only the word columns
topic_0 = pd.DataFrame(model.get_topic(0), columns=["Topic_0_word", "Topic_0_prob"])[["Topic_0_word"]]
topic_1 = pd.DataFrame(model.get_topic(1), columns=["Topic_1_word", "Topic_1_prob"])[["Topic_1_word"]]
topic_2 = pd.DataFrame(model.get_topic(2), columns=["Topic_2_word", "Topic_2_prob"])[["Topic_2_word"]]
topic_3 = pd.DataFrame(model.get_topic(3), columns=["Topic_3_word", "Topic_3_prob"])[["Topic_3_word"]]
topic_4 = pd.DataFrame(model.get_topic(4), columns=["Topic_4_word", "Topic_4_prob"])[["Topic_4_word"]]
topic_5 = pd.DataFrame(model.get_topic(5), columns=["Topic_5_word", "Topic_5_prob"])[["Topic_5_word"]]
topic_6 = pd.DataFrame(model.get_topic(6), columns=["Topic_6_word", "Topic_6_prob"])[["Topic_6_word"]]
topic_7 = pd.DataFrame(model.get_topic(7), columns=["Topic_7_word", "Topic_7_prob"])[["Topic_7_word"]]

# concatenate the dataframes
topics_df = pd.concat([topic_0, topic_1, topic_2, topic_3, topic_4, 
                       topic_5, topic_6,topic_7], axis=1)

The annotated topics show similar categorization to FASTopic. The differences are categorizing Spanish tokens into a separate topic (T7) and filling T1 and T5 with adjectives of positive meaning. Delivery problems in T4 are identical to FASTopic’s classification.

Image 5: Satisfaction factors modeling with BERTopic. Image by author

Again, let’s focus on topic activity over time, which gives dynamic topic models additional value for BI. BERTopic uses token frequencies (not topic weights as FASTopic) for topic activity analysis.

# topic activity over time
import plotly.graph_objects as go

# create timestamps
data['time'] = pd.to_datetime(data['time'])
timestamps = data['time'].to_list()

# generate topics over time, 20 bins correspond to Q frequency
topics_over_time = model.topics_over_time(docs, timestamps, nr_bins=20)

# filter out topic -1 containing outliers
topics_over_time_filtered = topics_over_time[topics_over_time['Topic'] != -1]

# visualize the filtered topics over time
fig = model.visualize_topics_over_time(topics_over_time_filtered)

# update legend to display only the topic number
fig.data = sorted(fig.data, key=lambda trace: trace.name)

for trace in fig.data:
    trace.name = trace.name[0]

# update the layout
fig.update_layout(
    width=1200,
    height=600,
    title='',
    legend_title_text='Topic',
    xaxis_tickangle=45           # set x-axis labels to 45-degree angle
)

# show the figure
fig.show()

Most topics are stable over time, except T4, which categorizes delivery problems. As with FASTopic, BERTopic shows that customers’ negative complaints about damaged boxes rose in mid-2018.

Image 6: Topic activity over time, BERTopic. Image by author.

Summary

Both models indicated delivery problems in mid-2018, which vanished in early 2019. With a topic model API monitoring customer comments on various channels, these problems can be fixed before they have a harmful effect on the brand.

The right data processing is essential for topic models to make sense in the applied world. Cleaning text in the right order minimizes the bias of each cleaning operation. Numbers and emoticons are typically removed first, followed by stopwords. Punctuation is cleaned afterward so that stopwords don’t break up into two tokens ("we’ve" -> "we" + ‘ve"). Additional tokens are removed in the next step in the clean data before lemmatization, which unifies tokens with the same semantics.

FASTopic deserves much better documentation, which now provides only basic information. Especially because its (1) simplicity of use and (2) stability in training on small datasets makes it a top-notch alternative to BERTopic. It is mainly practical for small companies like e-shops that typically don’t collect large text datasets and seek simple and efficient solutions. Data and full codes for this tutorial here.

If you enjoy my work, you can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!

References

[1] Grootendorst (2022). Bertopic: Neural Topic Modeling With A Class-Based TF-IDF Procedure. Computer Science.

[2] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. arXiv preprint: 2405.17978.

The post Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code appeared first on Towards Data Science.

How to Evaluate LLM Summarization

Isaac Tham — Wed, 22 Jan 2025 17:37:07 +0000

Image from Unsplash

Summarization is one of the most practical and convenient tasks enabled by LLMs. However, compared to other Llm tasks like question-asking or classification, evaluating LLMs on summarization is far more challenging.

And so I myself have neglected evals for summarization, even though two apps I’ve built rely heavily on summarization (Podsmart summarizes podcasts, while aiRead creates personalized PDF summaries based on your highlights)

But recently, I’ve been persuaded – thanks to insightful posts from thought leaders in the AI industry – of the critical role of evals in systematically assessing and improving LLM systems. ([link](https://applied-llms.org/) and link). This motivated me to start investigating evals for summaries.

So in this article, I will talk about an easy-to-implement, research-backed and quantitative framework to evaluate summaries, which improves on the Summarization metric in the DeepEval framework created by Confident AI.

I will illustrate my process with an example notebook (code in Github), attempting to evaluate a ~500-word summary of a ~2500-word article Securing the AGI Laurel: Export Controls, the Compute Gap, and China’s Counterstrategy (found here, published in December 2024).

∘ Why it’s difficult to evaluate summarization ∘ What makes a good summary ∘ Introduction to DeepEval ∘ DeepEval’s Summarization Metric ∘ Improving the Summarization Metric ∘ Conciseness Metrics ∘ Coherence Metric ∘ Putting it all together ∘ Future Work

Why it’s difficult to evaluate summarization

Before I start, let me elaborate on why I claim that summarization is a difficult task to evaluate.

Firstly, the output of a summary is inherently open-ended (as opposed to tasks like classification or entity extraction). So, what makes a summary good depends on qualitative metrics such as fluency, coherence and consistency, which are not straightforward to measure quantitatively. Furthermore, these metrics are often subjective – for example, relevance depends on the context and audience.

Secondly, it is difficult to create gold-labelled datasets to evaluate your system’s summaries against. For RAG, it is straightforward to create a dataset of synthetic question-answer pairs to evaluate the retriever (see this nice walkthrough).

For summarization, there isn’t an obvious way to generate reference summaries automatically, so we have to turn to humans to create them. While researchers have curated summarization datasets, these would not be customized to your use case.

Thirdly, I find that most summarization metrics in the academic literature are not suitable for practical-oriented AI developers to implement. Some papers trained neural summarization metrics (e.g. Seahorse, Summac etc.), which are several GBs big and challenging to run at scale (perhaps I’m just lazy and should learn how to run HuggingFace models locally and on a GPU cluster, but still it’s a barrier to entry for most). Other traditional metrics such as BLEU and ROUGE rely on exact word/phrase overlap and were created in the pre-LLM era for extractive summarization, and may not work well for evaluating abstractive summaries generated by LLMs, which could paraphrase the source text.

Nevertheless, in my experience, humans can easily distinguish a good summary from a bad one. One common failure mode is being vague and roundabout-y (e.g. ‘this summary describes the reasons for…’).

What makes a good summary

So what is a good summary? Eugene Yan’s article offers good detail on various summary metrics. For me, I will distil them into 4 key qualities:

Relevant – the summary retains important points and details from the source text
Concise – the summary is information-dense, does not repeat the same point multiple times, and is not unnecessarily verbose
Coherent – the summary is well-structured and easy to follow, not just a jumble of condensed facts
Faithful – the summary does not hallucinate information that is not supported by the source text

One key insight is that you can actually formulate the first two as a precision and recall problem – how many facts from the source text are retained in the summary (recall), and how many facts from the summary are supported by the main text (precision).

This formulation brings us back to more familiar territory of classification problems in ML, and suggests a quantitative way to evaluate summaries.

Some differences here are: firstly, a higher recall is better, holding summary length constant. You don’t want to score 100% recall with a summary the same length as the source. Secondly, you’d ideally want precision to be close to 100% as possible – hallucinating information is really bad. I’ll come back to these later.

Introduction to DeepEval

You’d be spoilt for choice with all the different LLM eval frameworks out there – from Braintrust to Langfuse and more. However, today I’ll be using DeepEval, a very user-friendly framework to get started quickly, both in general, as well as specifically with summarization.

DeepEval has easy out-of-the-box implementations of many key RAG metrics, and it has a flexible Chain-of-Thought-based LLM-as-a-judge tool called GEval for you too define any custom criteria you want (I’ll use this later)

Additionally, it has helpful infrastructure to organize and speed up evals: they’ve nicely parallelized everything with async and so you can run evals on your entire dataset rapidly. They have handy features for synthetic data generation (will cover in later articles), and they allow you to define custom metrics to adapt their metrics (exactly what we’re going to do today), or to define non-LLM-based eval metrics for more cost-effective & robust evals (e.g. entity density, later).

DeepEval’s Summarization Metric

DeepEval’s summarization metric (read more about it here ) is a reference-free metric (i.e. no need for gold-standard summaries), and just requires the source text (that you put as input field) and the generated summary to be evaluated (actual_output) field. As you can see, the set-up and evaluation code below is really simple!

# Create a DeepEval test case for the purposes of the evaluation
test_case = LLMTestCase(
  input = text,
  actual_output = summary
)

# Instantiate the summarization metric
summarization_metric = SummarizationMetric(verbose_mode = True, n = 20, truths_extraction_limit = 20)

# Run the evaluation on the test case
eval_result = evaluate([test_case], [summarization_metric])

The summarization metric actually evaluates two separate components under-the-hood: alignment and coverage. These correspond closely to the precision and recall formulation I introduced earlier!

For alignment, the evaluator LLM generates a list of claims from the summary, and for each claim, the LLM will determine how many of these claims are supported by truths which are extracted from the source text, producing the alignment score.

In the case of coverage, the LLM generates a list of assessment questions from the source text, then tries to answer the questions, using only the summary as context. The LLM is prompted to respond ‘idk’ if the answer cannot be found. Then, the LLM will determine how many of these answers are correct, to get the coverage score.

The final summarization score is the minimum of the alignment and coverage scores.

Improving the Summarization Metric

However, while what DeepEval has done is a great starting point, there are three key issues that hinder the reliability and usefulness of the Summarization metric in its current form.

So I have built a custom summarization metric which adapts DeepEval’s version. Below, I’ll explain each problem and the corresponding solution I’ve implemented to overcome it:

1: Using yes/no questions for the coverage metric is too simplistic

Currently, the assessment questions are constrained to be yes/no questions, in which the answer to the question is yes – have a look at the questions:

Image by author

There are two problems with this:

Firstly, by framing the questions as binary yes/no, this limits their informativeness, especially in determining nuanced qualitative points.

Secondly, if the LLM that answers given the summary hallucinates a ‘yes’ answer (as there are only 3 possible answers: ‘yes’, ‘no’, ‘idk’, it’s not unlikely it’ll hallucinate yes), the evaluator will erroneously deem this answer to be correct. It is much more difficult to hallucinate the correct answer to an open-ended question. Furthermore, if you look at the questions, they are phrased in a contrived way almost hinting that the answer is ‘yes’ (e.g. "Does China employ informational opacity as a strategy?"), hence increasing the likelihood of a hallucinated ‘yes’.

My solution was to ask the LLM generate open-ended questions from the source text – in the code, these are referred to as ‘complex questions’.

Additionally, I ask the LLM to assign an importance of the question (so we can perhaps upweight more important questions in the coverage score).

Since the questions are now open-ended, I use an LLM for evaluation – I ask the LLM to give a 0–5 score of how similar the answer generated from the summary is to the answer generated with the source text (the reference answer), as well as an explanation.

def generate_complex_verdicts(answers):
    return f"""You are given a list of JSON objects. Each contains 'original_answer' and 'summary_answer'.
    Original answer is the correct answer to a question. 
    Your job is to assess if the summary answer is correct, based on the model answer which is the original answer.
    Give a score from 0 to 5, with 0 being completely wrong, and 5 being completely correct.
    If the 'summary_answer' is 'idk', return a score of 0.

    Return a JSON object with the key 'verdicts', which is a list of JSON objects, with the keys: 'score', and 'reason': a concise 1 sentence explanation for the score.
..."""

def generate_complex_questions(text, n):
        return f"""Based on the given text, generate a list of {n} questions that can be answered with the information in this document.
        The questions should be related to the main points of this document. 
        Then, provide a concise 1 sentence answer to the question, using only information that can be found in the document.
        Answer concisely, your answer does not need to be in full sentences.
        Make sure the questions are different from each other. 
        They should cover a combination of questions on cause, impact, policy, advantages/disadvantages, etc.

        Lastly, rate the importance of this question to the document on a scale of 1 to 5, with 1 being not important and 5 being most important. 
        Important question means the question is related to an essential or main point of the document,
        and that not knowing the answer to this question would mean that the reader has not understood the document's main point at all.
        A less important question is one asking about a smaller detail, that is not essential to understanding the document's main point.

 ..."""

2: Extracting truths from source text for alignment is flawed

Currently, for the alignment metric, a list of truths is extracted from the source text using an LLM (a parameter truths_extraction_limit which can be controlled). This leads to some facts/details from the source text being omitted from the truths, which the summary’s claims are then compared against.

To be honest, I’m not sure what the team was thinking when they implemented it like this – perhaps I had missed a nuance or misunderstood their intention.

However, this leads to two problems that renders the alignment score ‘unusable’ according to a user on Github.

Firstly, the LLM-generated list of truths is non-deterministic, hence people have reported wildly changing alignment scores. This inconsistency likely stems from the LLM choosing different subsets of truths each time. More critically, the truth extraction process makes this not a fair judge of the summary’s faithfulness, because a detail from the summary could possibly be found in the source text but not the extracted truths. Anecdotally, all the claims that were detected as unfaithful, indeed were in the main text but not in the extracted truths. Additionally, people have reported that when you pass in the summary as equal to input, the alignment score is less than 1, which is strange.

To address this, I just made a simple adjustment – which was to pass the entire source text into the LLM evaluating the summary claims, instead of the list of truths. Since all the claims are evaluated together in one LLM call, this won’t significantly raise token costs.

3: Final score being min(alignment score, coverage score) is flawed

Currently, the score that is output is the minimum of the alignment and coverage scores (and there’s actually no way of accessing the individual scores without placing it in the logs).

This is problematic, because the coverage score will likely be lower than the alignment score (if not, then there’re real problems!). This means that changes in the alignment score do not affect the final score. However, that doesn’t mean that we can ignore deteriorations in the alignment score (say from 1 to 0.8), which are arguably signal a more severe problem with the summary (i.e. hallucinating a claim).

My solution was to change the final score to the F1 score, just like in ML classification, to capture importance of both precision and recall. An extension is to can change the weighting of precision & recall. (e.g. upweight precision if you think that hallucination is something to avoid at all costs – see here)

With these 3 changes, the summarization metric now better reflects the relevance and faithfulness of the generated summaries.

Conciseness Metrics

However, this still gives an incomplete picture. A summary should also concise and information-dense, condensing key information into a shorter version.

Entity density is a useful and cheap metric to look at. The Chain-of-Density paper shows that human-created summaries, as well as human-preferred AI-generated summaries, have an entity density of ~0.15 entities/tokens, striking the right balance between clarity (favoring less dense) and informativeness (favoring more dense).

Hence, we can create a Density Score which penalizes summaries with Entity Density further away from 0.15 (either too dense or not dense enough). Initial AI-generated summaries are typically less dense (0.10 or less), and the Chain-of-Density paper shows an iterative process to increase the density of summaries. Ivan Leo & Jason Liu wrote a good article on fine-tuning Chain-of-Density summaries using entity density as the key metric.

import nltk
import spacy
NLP = spacy.load("en_core_web_sm")

def get_entity_density(text):
  summary_tokens = nltk.word_tokenize(text)
  num_tokens = len(summary_tokens)
  # Extract entities
  doc = nlp(text)
  num_entities = len(doc.ents)
  entity_density = num_entities / num_tokens
  return entity_density

Next, I use a Sentence Vagueness metric to explicitly penalize vague sentences ( ‘this summary describes the reasons for…’) that don’t actually state the key information.

For this, I break up the summary into sentences (similar to the alignment metric) and ask an LLM to classify if each sentence is vague or not, with the final score being the proportion of sentences classified as vague.

prompt = ChatPromptTemplate.from_template(
    """You are given a list of sentences from a summary of a text.
    For each sentence, your job is to evaluate if the sentence is vague, and hence does not help in summarizing the key points of the text.

    Vague sentences are those that do not directly mention a main point, e.g. 'this summary describes the reasons for China's AI policy'. 
    Such a sentence does not mention the specific reasons, and is vague and uninformative.
    Sentences that use phrases such as 'the article suggests', 'the author describes', 'the text discusses' are also considered vague and verbose.
  ...
    OUTPUT:"""
)

class SentenceVagueness(BaseModel):
    sentence_id: int
    is_vague: bool
    reason: str

class SentencesVagueness(BaseModel):
    sentences: List[SentenceVagueness]

chain = prompt | llm.with_structured_output(SentencesVagueness)

Lastly, a summary that repeats the same information is inefficient, as it wastes valuable space that could have been used to convey new meaningful insights.

Hence, we construct a Repetitiveness score using GEval. As I briefly mentioned above, GEval uses LLM-as-a-judge with chain-of-thoughts to evaluate any custom criteria. As detecting repeated concepts is a more complex problem, we need a more intelligent detector aka an LLM. (Warning: the results for this metric seemed quite unstable – the LLM would change its answer when I ran it repeatedly on the same input. Perhaps try some prompt engineering)

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

repetitiveness_metric = GEval(
    name="Repetitiveness",
    criteria="""I do not want my summary to contain unnecessary repetitive information.
    Return 1 if the summary does not contain unnecessarily repetitive information, and 0 if the summary contains unnecessary repetitive information.
    facts or main points that are repeated more than once. Points on the same topic, but talking about different aspects, are OK. In your reasoning, point out any unnecessarily repetitive points.""",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    verbose_mode = True
)

Coherence Metric

Finally, we want to ensure that LLM outputs are coherent – having a logical flow with related points together and making smooth transitions. Meta’s recent Large Concept Models paper used this metric for local coherence from Parola et al (2023) – the average cosine similarity between each nth and n+2th sentence. A simple metric that is easily implemented. We find that the LLM summary has a score of ~0.45. As a sense check, if we randomly permute the sentences of the summary, the coherence score drops below 0.4.

# Calculate cosine similarity between each nth and n+2th sentence
def compute_coherence_score(sentences):
  embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
  sentences_embeddings = embedding_model.embed_documents(sentences)
  sentence_similarities = []
  for i in range(len(sentences_embeddings) - 2):
    # Convert embeddings to numpy arrays and reshape to 2D
    emb1 = np.array(sentences_embeddings[i])
    emb2 = np.array(sentences_embeddings[i+2])
    # Calculate cosine distance
    distance = cosine(emb1, emb2)
    similarity = 1 - distance
    sentence_similarities.append(similarity)
  coherence_score = np.mean(sentence_similarities)
  return coherence_score

Putting it all together

We can package each of the above metrics into Custom Metrics. The benefit is that we can evaluate all of them in parallel on your dataset of summaries and get all your results in one place! (see the code notebook)

One caveat, though, is that for some of these metrics, like Coherence or Recall, there isn’t a sense of what the ‘optimal’ value is for a summary, and we can only compare scores across different AI-generated summaries to determine better or worse.

Future Work

What I’ve introduced in this article provides a solid starting point for evaluating your summaries!

It’s not perfect though, and there areas for future exploration and improvement.

One area is to better test whether the summaries capture important points from the source text. You don’t want a summary that has a high recall, but of unimportant details.

Currently, when we generate assessment questions, we ask LLM to rate their importance. However, it’s hard to take those importance ratings as the ground-truth either – if you think about it, when LLMs summarize they essentially rate the importance of different facts too. Hence, we need a measure of importance outside the LLM. Of course, the ideal is to have human reference summaries, but these are expensive and not scalable. Another source of reference summaries would be reports with executive summaries (e.g. finance pitches, conclusion from slide decks, abstract from papers). We could also use techniques like the PageRank of embeddings to identify the central concepts algorithmically.

An interesting idea to try is generating synthetic source articles – start with a set of main points (representing ground-truth "important" points) on a given topic, and then ask the LLM lengthen into a full article (run this multiple times with high temperature to generate many diverse synthetic articles!). Then run the full articles through the summarization process, and evaluate the summaries on retaining the original main points.

Last but not least, it is very important to ensure that each of the summarization metrics I’ve introduced correlates with human evaluations of summary preference. While researchers have done so for some metrics on large summarization datasets, these findings might not generalize to your texts and/or audience. (perhaps your company prefers a specific style of summaries e.g. with many statistics).

For an excellent discussion on this topic, see ‘Level 2’ of Hamel Husain’s article on evals. For example, if you find that LLM’s Sentence Vagueness scores don’t correlate well with what you consider to be vague sentences, then some prompt engineering (providing examples of vague sentences, elaborating more) can hopefully bring the correlation up.

Although this step can be time-consuming, it is essential, in order to ensure you can trust the LLM evals. This will save you time in the long run anyway – when your LLM evals are aligned, you essentially gain an infinitely-scalable evaluator customised to your needs and preferences.

You can speed up your human evaluation process by creating an easy-to-use Gradio annotation interface – I one-shotted a decent interface using OpenAI o1!

In a future article, I will talk about how to actually use these insights to improve my summarization process. Two years ago I wrote about how to summarize long texts, but both LLM advances and 2 years of experience have led to my summarization methods changing dramatically.

Thanks so much for reading! In case you missed it, all the code can be found in the GitHub repo here. Follow me on X/Twitter and for more posts on AI!

What metrics do you use to evaluate LLM summarization? Let me know in the comments!

The post How to Evaluate LLM Summarization appeared first on Towards Data Science.

Data-Driven Decision Making with Sentiment Analysis in R

Devashree Madhugiri — Tue, 21 Jan 2025 19:06:30 +0000

Leveraging the Quanteda, Textstem and Sentimentr Packages to Extract Customer Insights and Enhance Business Strategy

Image by Ralf Ruppert from Pixabay

Should Businesses Really Hear Their Customers’ Voices?

In a rapidly evolving world that is getting more and more AI-driven every instant, businesses now need to constantly seek a competitive edge to remain sustainable. Companies may do this by regularly observing and analyzing customer opinions regarding their products and services. They achieve this by assessing comments from many sources, both online and offline. Identifying positive and negative trends in customer feedback allows them to fine-tune product features and design marketing strategies that meet the needs of customers.

Thus, customer opinions need to be discerned appropriately to find valuable insights that can help make informed business decisions.

Familiarizing Yourself with Sentiment Analysis

Sentiment analysis, a part of natural language processing (NLP), is a popular technique today because it studies people’s opinions, sentiments, and emotions in any given text. Businesses can understand public opinion, monitor brand reputation, and improve customer experiences by applying sentiment analysis to their collected feedback, which contains valuable information, but its unstructured nature can make it difficult to analyze. By regularly analyzing customer sentiments, companies can identify their strengths and weaknesses, decide on how to boost product development, and build better marketing strategies.

Powerful packages for sentiment analysis in both Python and R enable businesses to uncover valuable patterns, track sentiment trends, and make data-driven decisions. In this article, we will explore how to use different packages (Quanteda, Sentimentr and Textstem) to perform sentiment analysis on customer feedback by processing, analyzing, and visualizing textual data.

Adding a Real-world Context

For this tutorial, let us consider a fictional tech company, PhoneTech, that has recently launched a new smartphone in the budget segment for its young audience. Now, they want to know the public perception of their newly launched product and, hence, want to analyze the customer feedback from social media, online reviews, and customer surveys.

To achieve this, PhoneTech needs to use Sentiment Analysis to find product strengths and weaknesses, guide product development, and adjust marketing strategies. For example, PhoneTech has collected feedback from various platforms like social media (e.g., informal comments like "The camera is but battery life . #Disappointed"), online reviews (e.g., semi-structured comments such as "Amazing build quality! Battery could last longer, though"), and customer surveys (e.g., structured responses to questions like "What do you like/dislike about the product?").

It’s important to note that customer feedback often includes informal language, emojis, and specific terms. We can use R packages to clean, tokenize, and analyze this data in order to turn raw text into actionable business insights.

Implementing Sentiment Analysis

Next, we’ll build a model for sentiment analysis in R using the chosen quantedapackage.

1. Importing necessary packages and dataset

For evaluating sentiments in a given dataset, we need several packages, including dplyr to manipulate the data of customer feedback entries, quanteda(License: GPL-3.0 license) for text analysis, and quanteda.textplots to create a word cloud. Additionally, tidytext (License: [MIT](https://cran.r-project.org/web/licenses/MIT) + file [LICENSE](https://cran.r-project.org/web/packages/sentimentr/LICENSE)) to use sentiment lexicons for scoring while ggplot2 will be used for data visualization, textstem (License: GPL-2) will aid in text stemming and lemmatization, sentimentr (License: MIT + file LICENSE) will be utilized for sentiment analysis, and RColorBrewer will provide color palettes for our visualizations.

These can be easily installed with the following command-

install.packages(c("dplyr", "ggplot2", "quanteda", "quanteda.textplots", 
                   "tidytext", "textstem", "sentimentr", "colorbrewer"))

After installation, we can load the packages as:

# Load necessary R packages
library(dplyr)
library(ggplot2)
library(quanteda)
library(quanteda.textplots)
library(tidytext)
library(textstem)
library(sentimentr)
library(RColorBrewer)

Dataset for customer reviews

In the case of the real-world dataset, this data would actually be scraped using multiple tools from various social media platforms. The collected data would represent the feedback that includes informal language, emojis, and domain-specific terms. Such a combined dataset can allow for a detailed analysis of customer sentiments and opinions across different sources.

However, for this tutorial, let us use a synthetic dataset generated in R using packages that cover these above points. The dataset with 200 rows represents customer feedback (~2–3 sentences in each row) from different sources and includes raw text with emojis and symbols, abbreviations, etc., mimicking real-world scenarios. These sentences are simply a generic representation of the reviews commonly seen on e-commerce or product websites (talk about keywords such as UI, design, phone features and price, experienced battery life, customer service support, etc.) and are combined in random patterns with emojis for creating a review text.

You can find the synthetic dataset generated using R on GitHub here.

#load the dataset
data <- read.csv("sentiment_data.csv")
# Print the dimensions (number of rows and columns) of the dataset
dataset_size <- dim(data)
print(dataset_size)

Since the dataset has a lot of text, let’s print a few words per row for the dataset overview.

To achieve this, we’ll first define a function to extract the first few words from each feedback entry in our dataset. We’ll then randomly sample 5 rows from the dataset and apply the function to truncate the feedback text. Finally, we’ll print the resulting data frame to get an idea of the feedback text.

# Function to extract the first few words
extract_first_words <- function(text, num_words = 10) {
 if (is.na(text) || !is.character(text)) {
 return(NA)
}
words <- unlist(strsplit(text, "s+"))
 return(paste(words[1:min(num_words, length(words))], collapse = " "))
}
# Randomly sample 5 rows from the dataset
set.seed(123)
random_feedback <- data[sample(nrow(data), size = 5, replace = FALSE), ]
# Extract the first 5 words
random_feedback$text <- sapply(random_feedback$text, function(text) {
truncated <- extract_first_words(text)
paste0(truncated, "...")
})
# Print the data frame
print(random_feedback)

2. Preprocessing Text Data

Before moving to text analysis, we need to preprocess the text to ensure a clean and consistent format. Preprocessing will involve several key steps:

Text Cleaning which includes removal of punctuation, numbers, and special characters;
Text Normalizing which includes conversion of the alphabets to lowercase;
Tokenizing the text which includes splitting the text into individual words or tokens;
Removing stop words which includes intentional removal of words that do not contribute to sentiment analysis (e.g., "the," "and"); and finally,
Stemming or lemmatizing the text where the words are reduced to their root forms. These steps help lessen the noise and improve the accuracy of the analysis.

Now, we’ll implement the above preprocessing steps on our dataset.

# Cleaning the dataset
corpus <- quanteda::corpus(data$text)
tokens_clean <- quanteda::tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("en"))
# Convert tokens to character vectors for lemmatization
tokens_char <- sapply(tokens_clean, function(tokens) paste(tokens, collapse = " "))
# Lemmatize tokens using WordNet
lemmatized_texts <- lemmatize_strings(tokens_char)

In this code, we convert the dataset’s text column into a quanteda corpus object. We clean the text by tokenizing it, which involves removing punctuation, numbers, and symbols, converting all words to lowercase, and filtering out common stopwords. Finally, stemming is applied to reduce words to their root forms, such as changing "running" to "run," to ensure consistency in text analysis. It is important to note here that we haven’t used stemming since it can cause partial or incomplete extraction of words due to the simplification of words to root forms. In another way, it applies simple rules to chop off the ends of words. For example, the algorithm might remove the suffix "-ing" from "amazing," resulting in "amaz," or "terrible" could be "terribl". To avoid that and get more accurate root forms, instead, we’ll use lemmatization, which is a more sophisticated process that relies on dictionaries to map words and considers the context and part of speech of the words to return their base or dictionary forms.

Now that we have cleaned and tokenized the text data, we can move on to the next step. Our goal is to analyze the sentiments in the feedback entries. We will use the sentimentr package to evaluate the sentiments in our structured data, providing insights into the emotional tone of the feedback entries.

3. Performing Sentiment Analysis Using Sentimentr package

Now, we can perform sentiment analysis on these sentences with the sentiment function from the sentimentr package. This function calculates sentiment scores for each piece of text, identifying positive and negative words.

Next, we summarize the sentiment scores for each document. We group the scores by document and calculate the total number of positive and negative words. We also calculate a compound score and categorize the overall sentiment as either positive or negative.

# Perform sentiment analysis using sentimentr
sentiment_scores <- sentiment(lemmatized_texts)
# Summarize sentiment scores for each document
sentiment_summary <- sentiment_scores %>%
group_by(element_id) %>%
summarize(
positive_words = sum(sentiment > 0),
negative_words = sum(sentiment < 0),
compound = sum(sentiment)
) %>%
mutate(
sentiment = ifelse(compound > 0, "Positive", "Negative")
)

Finally, we merge this sentiment summary with the original text data and print the results. This gives us a clear, concise evaluation of the sentiment in our dataset.

# Merge with original text for context using row number as a common column
sentiment_summary <- sentiment_summary %>%
mutate(doc_id = as.character(element_id)) %>%
left_join(data %>% mutate(doc_id = as.character(1:nrow(data))), by = "doc_id") %>%
select(text, positive_words, negative_words, compound, sentiment)
# Print the sentiment evaluation table
print(sentiment_summary)

The output table clearly shows the positive and negative word count per review in each row along with the compound score as well as the predictive sentiment. At a glance, the model does a reasonably good job of sorting positive and negative reviews. Although some reviews clearly look negative (e.g. "Would not recommend….") due to the incomplete display of the review content in a table, it is quite likely that there are more positive keywords (satisfies, best, good, etc) contained in that particular review that resulted in a positive sentiment as per the model evaluation. Hence, such reviews need to be carefully reviewed separately before being included in the interpretation of the results for decision-making.

Next, we need to print a Document-Feature Matrix (DFM) which is a structured representation of the text where rows represent documents and columns represent features (words). Each cell contains the frequency of a specific word in a document. Here, the cleaned corpus is transformed into a DFM, making it ready for statistical analysis and visualization.

# Create a document-feature matrix (DFM)
dfm <- dfm(corpus_clean)

This section calculates sentiment metrics for each text entry. Positive and negative word counts are summed, and a compound score is computed as the difference between these counts. A positive compound score indicates positive sentiment and a negative score indicates negative sentiment. This information is combined with the original text for a comprehensive sentiment evaluation.

4. Analyzing Sentiment Proportions

# Evaluate sentiment proportions as percentages
sentiment_proportion <- sentiment_summary %>%
group_by(sentiment) %>%
summarise(count = n()) %>%
mutate(proportion = count / sum(count) * 100)
print(sentiment_proportion)

To understand the overall sentiment distribution, we calculate the proportions of positive and negative sentiments in the dataset. Grouping by sentiment type, the count of entries in each category is calculated and normalized to derive their proportions.

5. Visualizing Sentiment Distribution

We’ll create a bar chart in ggplot2 to visualize the proportions of positive and negative sentiments for an intuitive visualization of the sentiment distribution, making it easy to observe which type of sentiment seems dominant.

# Plot sentiment distribution as percentages
ggplot(sentiment_proportion, aes(x = sentiment, y = proportion, fill = sentiment)) +
geom_bar(stat = "identity", width = 0.7) +
scale_fill_manual(values = c("Positive" = "blue", "Negative" = "red")) +
labs(title = "Distribution of Sentiments",
x = "Sentiment Type",
y = "Percentage",
fill = "Sentiment") +
theme_minimal() +
theme(panel.grid = element_blank())

Image by Author

In our dataset, positive sentiment seems dominant. Hence, a larger proportion of the customers are happy with PhoneTech’s product.

6. Visualizing Top Terms

# Plotting top 10 terms
top_terms <- topfeatures(dfm, 10)
bar_colors <- colorRampPalette(c("lightblue", "blue"))(length(top_terms))
# Barplot
barplot(top_terms, main = "Top 10 Terms", las = 2, col = bar_colors, horiz = TRUE, cex.names = 0.7)

Image by Author

The 5 most frequent terms in the dataset seem to be "recommend", "design", "smartphone", "display," and "terrible". Although such words are not very useful standalone for understanding sentiment, PhoneTech personnel could dig deeper into how these words are associated with the product in the reviews and build some other plots to add more context in which it would be clear whether these words are used in a certain review.

So, let’s filter out the positive feedback, create a DFM, and plot again to see what customers are really saying about the product.

# Filter positive feedback 
positive_feedback <- sentiment_summary %>% 
 filter(sentiment == "Positive") 
# Create a DFM for positive feedback 
positive_tokens <- quanteda::tokens(positive_feedback$text, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
 tokens_tolower() %>% 
 tokens_remove(stopwords("en")) 
positive_dfm <- dfm(positive_tokens)
# Plot top 5 terms with a gradient
top_positive <- topfeatures(positive_dfm, 5)
bar_colors <- colorRampPalette(c("lightblue", "blue"))(length(top_positive))
# Plot with gradient
barplot(top_positive, main = "Top 5 Positive Terms", las = 2, col = bar_colors, horiz = TRUE, cex.names = 0.7)

Image by Author

The product performance, smartphone (could likely indicate brand), display, and design seem to be the most talked about in the dataset.

Another way to visualize these sentiments in our dataset is by generating a word cloud and fine-tuning the word frequencies using the max_words parameter as needed.

7. Generating a Word Cloud

# Word cloud
textplot_wordcloud(dfm, max_words = 200, color = RColorBrewer::brewer.pal(8, "Reds"), min_size = 0.5, max_size = 8)

We can also display the most frequent terms in an engaging and intuitive format with the use of word cloud when working on sentiment analysis tasks. It is important to note that larger words indicate higher frequencies, and this plot is particularly useful for quickly identifying key themes in the given dataset.

Image by Author

For the PhoneTech team, it might be worth considering two separate positive and negative word clouds to understand better what the most loved feature of the product is and what the pain point is.

8. Sampling and Reviewing Sentiments

Finally, we’ll print five random sentences from the dataset to inspect their sentiment evaluation results. This will help us validate the sentiment analysis outputs and gain insights into individual entries.

# Sample 5 sentences from the dataset
sample_indices <- sample(1:nrow(sentiment_summary), 5)
sample_sentiment_summary <- sentiment_summary[sample_indices, ]
# Print the sample sentences
print(sample_sentiment_summary)

So, all the above steps form a comprehensive pipeline for analyzing textual data as well as extracting valuable insights. Together, these steps help to transform raw text into actionable insights, supporting data-driven decision-making for any company.

Interpreting Sentiment Analysis Results

It is crucial to assess and evaluate the findings of the sentiment analysis correctly. **** For this, we generated a Document-Feature Matrix (DFM) to find top words and overall sentiment distribution, helping us understand the overall customer mood and identify patterns in the feedback. Additionally, our model generated sentiment scores to provide an idea about the tone of the reviews.

For example, PhoneTech finds that 68% of the feedback is positive, with the top words being "design" and "performance," it highlights key selling points for marketing. Conversely, the remaining 32% of reviews, i.e., negative comments, talk about customer service and poor photos, indicate potential areas for improvement.

Comparing sentiment trends over time or across sources, such as social media versus online reviews, helps identify shifts in customer perception. An accurate interpretation is important for making informed decisions and developing targeted strategies.

While the model seems to effectively identify positive and negative reviews, further steps can involve fine-tuning the model to sort neutral reviews, if any, for a more comprehensive analysis.

Applying Sentiment Insights to Fine-tune Strategy

The sentiment analysis has revealed some key areas of product improvement and its strengths for PhoneTech that can be leveraged to enhance its business. By addressing both positive and negative customer feedback, PhoneTech can drive overall satisfaction and attract more buyers.

Based on sentiment analysis results, PhoneTech could identify the following actionable insights and strategies to improve its business:

Positive Strategies:

(1) Refine Marketing Strategies:

Customers seem to be happy with the sleek and fast UI.
Positive feedback on the UI design indicates that this is a key selling point, which PhoneTech should continue promoting in their marketing campaigns to attract more buyers.

Negative Strategies:

(1) Enhance Product Features:

Frequent complaints about image quality suggest an issue with either the hardware or software.
Improving these areas quickly can enhance the user experience and reduce negative reviews.

(2) Addressing Customer Service Issues:

Handling customer service issues and resolving them promptly will boost product satisfaction.
These actions can prevent or reduce negative reviews while ensuring a better user experience and increasing overall reliability.

Best Practices in Sentiment Analysis

Text Context: As lexicon-based sentiment analysis often misses sarcasm and context, using advanced techniques like machine learning helps better capture nuances.
Domain-Specific Language: As general lexicons may not understand industry-specific terms and slang, tailoring lexicons to include technical terms relevant to the industry improves accuracy.
Use of Informal Language and Emojis: Since informal language and emojis can be challenging to analyze, using tools like quanteda to clean and systematically analyze data is beneficial.
Combining Techniques: As relying on one method limits analysis depth, combining text processing with machine learning provides comprehensive insights.

Key Takeaways

Sentiment analysis helps businesses understand customer opinions to improve products and services.
The R packages quanteda, sentimentr, and textstemwork well together for text analysis of customer reviews.
The outlined approach for sentiment analysis can be easily applied across industries like finance, healthcare, and retail for actionable insights.

Conclusion

Sentiment analysis gives businesses a clear idea about their customer needs and pain points. Companies can leverage insights to improve products and craft data-driven strategies.

In this article, we explored how R packages can help with sentiment analysis on customer feedback for a tech product. We discussed the background of the challenge with possible steps such as including data collection and preparation, corpus creation, tokenization, feature extraction, building sentiment models, and visualizing results to implement the sentiment analysis in R. We also considered the outcomes of the analysis that seem to have an impact and need to be considered by the company for further refining the product.

Other domain companies that are looking to gain actionable insights, enhance product features, refine marketing strategies, and monitor brand reputation effectively could take a significantly similar approach to sentiment analysis.

The post Data-Driven Decision Making with Sentiment Analysis in R appeared first on Towards Data Science.

Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT

Shirley Li — Tue, 21 Jan 2025 18:19:27 +0000

Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT

(Image from Unsplash)

This is the third article in our GPT series, and also the most practical one: finally, we will talk about how to effectively fine-tune LLMs.

It is practical in the way that, if you were asked to train your own LLMs today, you can skip pre-training and jump straight into using an open-source LLM or SLM; However, very likely you’ll still need to finetune it a bit on your own data and task, and that is where this article can come to help.

More specifically, we will focus on two finetuned models – Codex and InstructGPT, as they represent addressing two types of challenges in LLM finetuning:

Codex needs to adapt a pretrained LLM to a different modality (code script), as programming languages have many unique characteristics than natural language;
InstructGPT aims to make the model more aligned with human preferences, which cannot be achieved automatically by traditional language modeling objectives.

As we will see later, both challenges demand creativity and carefulness at every stage of the finetuning process: how to collect high-quality data, how to modify model architectures, how to effectively initialize your model, how to determine a proper objective, and how to properly evaluate it.

Below is the outline for this article:

Overview: why we need finetuning and what makes it so challenging; GPT3.5 and its finetuned versions.
Codex: how to evaluate code generation properly, how to collect data and how to adapt the model to process programming languages.
InstructGPT and ChatGPT: how to evaluate alignment, why RLHF works, and how it is implemented in InstructGPT.
Summary: best practices in LLM finetuning.

Below are the links to our previous articles if you are interested:

Part 1: An In-Depth Look at GPT-1 and What Inspired It: where we cover the pre-training plus finetuning paradigm and its evolution from CV to NLP, previous pre-training efforts such as Word2Vec and GloVe, decoder-only Transformers, auto-regressive vs. auto-encoding LM, and key innovations of GPT-1.
Part 2: GPT-2 and GPT-3: where we cover how GPT models were scaled up from 117M to 175B, under the philosophy of exploring task-agnostic pre-training via scaling hypothesis and in-context learning.

Overview

As we explained in our second article, both GPT-2 and GPT-3 can be considered as OpenAI’s experiments to test the potential of task-agnostic pre-training. While doing so, the authors also mentioned finetuning as a promising direction for future studies, as it might help the model to further improve its performance on certain tasks.

Why is Finetuning Needed?

The reasons are three-fold.

The first reason is of course performance. Pre-trained models are more like generalists that can perform a wide range of tasks reasonably well, but still they might struggle to beat the specialists trained on a particular task. If our goal is to have such a specialized model to help us on a very specific task, then finetuning should be definitely considered.

Another reason is that, albeit being generally powerful, GPT-3 models are not always reliable in following human instructions, especially when those instructions became complex. This is because, as the authors explained in InstructGPT paper, that the pre-training objective focuses mainly on language modeling like predicting the next token, but such capabilities cannot translates to instruction-following. Thus, some special finetuning strategies are needed.

There are also concerns on safety and ethical aspects, due to very similar reasons that auto-regressive language modeling alone is not sufficient to enforce the model to avoid generating harmful or biased answers. For that issue, finetuning can also enable us to better control the generation process.

Challenges in Finetuning

Broadly speaking, there are two types of challenges in finetuning LLMs: the need to adapt to a new modality, and the need to align the model with human preferences.

Taking Codex as an example for the former case, where the pre-trained model needs to be applied to a different modality that presents some unique characteristics, for example, to process code scripts it needs to understand basic syntax of a specific programming language, handle static and dynamic types and even infer types, and correctly handle indentations in languages like Python.

The latter case is more tricky in some way, as "alignment" itself is a pretty vague and controversial concept, and it has to be defined more clearly and translated to a set of measurable aspects before we can actually finetuning towards that goal. Moreover, even if we have worked out a definition of alignment, achieving that goal is also non-trivial, as there is no ready-to-use training objectives directly connect to it.

On top of that, we also need to collect high-quality domain-specific training data and rethink the evaluation process, including the evaluation dataset as well as the evaluation metrics to use.

In later sections, we will see how Codex and InstructGPT handled these issues. In particular, we will highlight how they implemented every step with both creativity and carefulness, from which anyone who wants to finetune his or her own LLM can learn something.

GPT-3.5

GPT-3.5 series typically refer to the model series finetuned on top of GPT-3, including the following variants (see wiki):

code-davinvi-002: a version of Codex.
text-davinci-002: a transitional model from GPT-3 to InstructGPT.
text-davinci-003: more similar to InstructGPT.

Overall, GPT-3.5 could be considered as finetuned GPT-3 with enhanced instruction following, better generation quality, and better steerability. It is the foundation to several other models including ChatGPT, Codex, Whisper and the text model of DALL-E2, which demonstrates the potential of effectively finetuning LLMs on specialized tasks.

In the following sections, we will dive deeper into Codex and InstructGPT. Rather than covering every detail of their finetuning process, we will mainly focus on the aspects that best showcase the importance of creativity and carefulness.

Codex

The Codex model was released in 2021 and is specialized in Python code-writing.

Below are a few aspects that we want to highlight.

Evaluation of Code Generation

When building a model for a new task, the first thing that often comes to mind is how to evaluate that task properly.

This is important because, without an effective evaluation protocol, we cannot determine if we are really making any progress, and sometimes we even cannot identify the gaps in our current model in the first place.

In the case of Codex, the authors first realized that standard match-based metrics such as BLEU score are not suitable for measuring code generation performance.

In case you are not familiar with BLEU score: it is widely used for evaluating text generation tasks such as machine translation, by comparing overlapping phrases and calculating a precision score, while also considering text length to ensure balance.

However, the same coding problem might be solved with different data structures or algorithms. For example, generating a Fibonacci sequence can be implemented by either a top-down or bottom-up DP algorithm, resulting in very different code scripts:

def fib_top_down(n, memo={}):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_top_down(n-1, memo) + fib_top_down(n-2, memo)
    return memo[n]

def fib_bottom_up(n):
    if n <= 1:
        return n
    dp = [0] * (n + 1)
    dp[0], dp[1] = 0, 1
    for i in range(2, n + 1):
        dp[i] = dp[i-1] + dp[i-2]
    return dp[n]

In that case, if we evaluate both solutions against a given reference solution using BLEU score, it is very likely that one or even both solutions will have very low BLEU scores, even though both solutions are correct.

An alternative way is to evaluate by what the authors called "functional correctness", for example the pass@k metric used by Kulal et al, where for each problem we will generate k code samples and test each of them, and then a problem can be considered as solved if any sample passes the unit tests. In the end, the total fraction of problems solved is reported. However, as the authors pointed out, calculating pass@k with this definition will result in high variance due to randomness in this process, especially when k is small.

To mitigate this issue, the authors propose another way to estimate pass@k: instead of generating k samples directly, they generate n ≥ k samples per task. As more samples are generated and tested, the estimation process will be more reliable even if k is small. And then, based on how many samples are correct (assume c samples passes unit tests), an unbiased estimator can be estimated as below:

Figure 1. Left: the optimized pass@k definition. right: a numerically stable script to calculate pass@k. (image from Codex paper.)

where

C(n, k) is the number of ways to choose k samples out of n;
C(n-c, k) is the number of ways to choose k samples out of the (n-c) incorrect samples;
Thus, C(n-c, k)/C(n, k) represents the probability that all chosen samples are incorrect;
Finally, 1 – C(n-c, k)/C(n, k) represents the probability that at least one sample is correct.

To further prove that optimizing for BLEU score is not equivalent to optimizing for functional correctness, the authors also plot the BLEU score densities for correct (blue) and wrong (green) solutions for 4 random coding problems, where the distributions are clearly not separable:

Figure 2. BLEU score probability density for correct (blue) and wrong (green) solutions for 4 random problems. (Image from Codex paper.)

Beyond optimizing for the evaluation metric, the authors also built a new dataset called HumanEval, which contains 164 hand-written programming problems. As shown in the example below, each problem includes a function signature, a docstring, a body and an average of 7.7 unit tests:

Figure 3. Example problems from the HumanEval dataset. (Image from Codex paper.)

Note that as the authors mentioned in the paper, it is important for these tasks to be hand-written, since otherwise the problems for evaluation might be overlap with that for training. Also, to ensure the testing process will not pose any risks due to malicious code, the authors also created a sandbox to execute code scripts.

Training Data Collection

Moving to the training part, the first question is how to collect high-quality training data. For code generation, the good news is that we can leverage the vast amount of code repositories from GitHub, but still some data cleaning strategies are needed, as the paper mentioned:

We filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters.

Note that most of these cleaning strategies are specialized to programming languages, so we might need to come up with other ideas when cleaning our own data.

Adaptations in Finetuning

The most important adaptation is for the tokenizer, due to the obvious reason that the distribution of words in GitHub code differs a lot from that of natural language. In the Codex paper, the authors noted that this is especially the case when encoding whitespaces, making the original GPT-3 tokenizer less effective.

To fix that issue, an additional set of tokens were added to the vocabulary, to represent whitespace runs of different lengths. As mentioned in the paper, this simple modification enables representing code with 30% fewer tokens.

So, if our model needs to handle an input corpus presents different distribution with natural languages, we might need to do some study on the distribution and modify the tokenizer a bit as well.

Findings in Evaluation

Firstly, the figure below shows the pass rates of different models on the HumanEval dataset. Overall, all the Codex variants show significantly better performance compared to GPT-3, where

Codex (finetuned on code) solves 28% of the problems;
Codex-S (finetuned on standalone functions) solves 37.7%;
Codex-S with generating 100 samples and selecting the one with the highest mean log-probability solves 44.5%;
Codex-S oracle which selects the sample that passes the unit tests solves an amazing of of 77.5% problems.

Figure 4. Codex pass rates. (Image from Codex paper.)

Plus, a scaling law similar to that of GPT-3 is also observed, suggesting better performance can be achieved with even larger models:

Figure 5. Test loss vs. number of parameters. (Image from Codex paper.)

And the authors also noticed that higher temperatures are more preferred for larger k, highlighting the importance of careful hyper-parameter tuning:

Figure 6. Higher temperatures are preferred for larger k. (Image from Codex paper.)

InstructGPT and ChatGPT

Evaluation of Alignment

How to properly evaluate "alignment" is also challenging, as the definition of alignment is not as clear as other aspects such as accuracy. In this work the authors define alignment as if the models are "helpful, honest, and harmless" and convert them to more measurable properties:

Helpful: by measuring if the model could follow instructions and even infer intentions from a few-shot prompt.
Honest: by measuring truthfulness, or in the author’s words, "if the model’s statements about the world are true". More specifically, they propose to measure it by hallucination rate on the TruthfulQA dataset.
Harmless: by measuring "if an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content", and benchmarking on datasets designed to measure bias and toxicity.

On top of that, to make sure the finetuning process will not cause severe regressions on pre-training performance, the evaluation process also need to reflect quality on both the pre-training and finetuning objectives. For that reason, InstructGPT was evaluated on two separate datasets:

Evaluations on API distribution: this is mainly for evaluating the finetuning quality, by asking human labelers to rate which output is preferred;
Evaluations on public NLP datasets: this evaluates both the pre-training and finetuning quality, including traditional NLP datasets as well as datasets for evaluating model safety like truthfulness, toxicity and bias.

Next, we will briefly explain how RLHF works and how it is implemented in InstructGPT.

RLHF (Reinforcement Learning from Human Feedback)

The figure below shows the 5 elements in a typical Reinforcement Learning scenario:

Figure 7. Five elements in RL: Agent, Environment, Reward, State and Action. (Image from wiki.)

Now imagine you are teaching your puppy to sit, where you can find all the 5 elements:

Agent: Your puppy learning this new command "sit".
Environment: Everything around your puppy.
State: The situation your puppy is in (whether it is sitting or not).
Reward: A treat that you give your puppy when it follows your command;
Action: What your puppy could do, like sitting, jumping or barking.

Reinforcement Learning works like this: In the beginning your dog (agent) didn’t understand what "sit" means, but it will try different things like running, sitting or even barking (actions) in your house (environment). Every time it sits, it will get a treat (reward). Over time your puppy learns that sitting gets a treat and it appears like it finally understands "sit".

Training a model with RL follows a very similar trial-and-error approach. The key to RL is having a well-designed reward. This reward must be closely aligned with the goal; otherwise the agent will not be able to learn the desired behaviors. Meanwhile, producing such a reward should be as easy and quick as possible, since if it is too slow or too complicated to calculate the reward, the RL process will also become extremely slow, making it less useful in practical tasks.

For example, in a game, every action the agent takes will automatically get a score from the environment, and this score is directly connected to your agent’s performance in playing this game.

However, in many real-world applications, there is no ready-to-use reward like a score in a game. Instead researchers have to take great efforts in defining a proper reward function. Moreover, some desired behaviors are very difficult to translate into reward functions – for example, how could you define a reward function to guide the agent to answer questions more politely?

This leads to RLHF: Reinforcement Learning from Human Feedback.

Again in the puppy training example, imagine your puppy finally learns to sit, but sometimes it also barks while sitting, or it will jump onto the couch first instead of sitting quietly on the floor.

What can you do in that case?

With RLHF, you don’t just give your puppy a treat every time it sits. Instead, you give treats by comparing its behaviors. For example, if the puppy sits quietly on the floor, it gets a bigger reward than if it sits while barking or after jumping onto the couch. This way, your puppy learns that sitting quietly on the floor is better, even though you didn’t explicitly explain what "quiet" means.

As we mentioned before, having an easy and fast reward is the key to RL, which makes it unrealistic to involve a human into the training loop to provide direct feedback. To overcome this issue, we can collect some human feedback first, and then use these feedback to learn a reward function to mimic human preferences when comparing two actions.

In summary, RLHF typically involves three stages:

Collect human feedback: sampling model outputs, and ask human judges to compare which is better.
Learn a reward model by mimicking human judge’s preferences.
Train a better policy using the leant reward model in the RL process.

In case you are not familiar with RL terminology: a policy refers to the agent’s strategy to choose actions based on the state of the environment.

Next we will cover how this RLHF approach is implemented in finetuning InstructGPT.

Implementation of RLHF in InstructGPT

InstructGPT and ChatGPT were trained using the same model (see this blog), with RLHF being the key element in finetuning.

The training process largely follows the steps we have introduced in the previous section, with special care on data quality and implementation details, which in my opinion, are equivalently important to make InstructGPT such a success.

Now let me break it down.

Figure 8. An illustration of the RLHF steps in training InstructGPT/ChatGPT. (image from InstructGPT paper.)

Step 1: Collect demonstration data and train a supervised policy

In this step, human labelers were asked to provide high-quality demonstrations of the desired behavior for each prompt.

Prompt dataset: To begin with, you need to have a prompt dataset from which you can sample individual prompts, and ideally that prompt dataset should be both useful and diverse.

To do that, the authors took an iterative approach: in the very beginning, labelers were asked to manually write some seed prompts, and these data were used to train a model via supervised learning. This model was later deployed to the OpenAI API to collect text prompts from users, which later formed the prompt dataset.

The table below shows the distribution of this prompt dataset, as diversity is very important in making sure the model will be trained on various tasks:

Human data collection: human data are needed in three components throughout the RLHF process, including writing demonstrations in Step 1, providing comparison data in Step 2, and conducting final evaluations after finetuning.

In the paper the authors mentioned many practices to ensure data quality:

Firstly, high-quality data come from good labelers. To ensure their ability in data labeling, a screening test was conducted to select labelers who were "sensitive to the preferences of different demographic groups, and were good at identifying outputs that were potentially harmful".
Secondly, to ensure consistency between all the labelers, an onboarding process was setup to train all labelers, and detailed instructions for each task were provided. The authors also mentioned that they setup a shared chat room to answer questions from labelers.
Finally, to see how the model generalizes to the preferences of different labelers, a separate group of labelers who didn’t got through the screening test were hired for evaluation.

Based on these human demonstration data, a pretrained GPT-3 model was finetuned using supervised learning in the first step. This model is referred to as the baseline policy, which will be used to produce comparison outputs in Step 2 and initialize the PPO algorithm in Step 3.

Step 2: Collect comparison data and train a reward model

Comparison data collection: Once the baseline policy is available, it is used to generate outputs for some sampled prompts, and these outputs will be reviewed and ranked by human labelers from the best to the worst. To speedup this ranking process, a set of K outputs will be shown simultaneously to the human labelers, where K ranges from 4 to 9.

Reward model training: The reward model was initialized from the supervised baseline policy, by removing the final unembedding layer and training on the comparison data. In particular, the authors mention that training all comparisons from each prompt as a single batch rather than shuffling the comparisons can help alleviate overfitting. It was trained to assign scalar scores to input-response pairs, with 6B parameters. Note that we need to seek a balance when deciding the size of this reward model: it needs to be sufficiently large to accurately mimic human preferences, however it cannot be too large since it needs to support fast inference during the RL process.

Step 3: Optimize a policy using the reward model with PPO

At this point we have got everything ready to finetune the model with RLHF: the initial policy and the reward model. The training in this step follows a typical RL process: in each episode, a new prompt is sampled (the "state") and new outputs will be generated (the model’s "action") by the current policy (the "agent"), and then the reward model will calculate a reward for the output ("reward"), according to which the policy will be updated using PPO.

Don’t worry if you are not familiar with PPO – it is simply a method designed to help the agent to slowly update its strategies.

A few things to mention here:

A per-token KL penalty is added at each token to mitigate the over-optimization of the reward model.
The authors further experimented with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets (such regressions are often called "the alignment tax"), which was referred to as "PPO-ptx". In this paper, InstructGPT actually refers to the PPO-ptx models.

Note that Step 2 and Step 3 can be iterated continuously:

With an updated policy (from Step 3), we can generate new outputs and collect more comparison data, which can be used to train a new reward model by repeating Step 2;
With a new reward model (from Step 2), we can get a better policy by repeating Step 3.

Findings in Evaluation

Due to space limitation we will not go through all the evaluation results in this article, instead we will just highlight several new findings.

As perhaps the most important finding, results show that RLHF can indeed improve alignment. The figure below shows the win rate against the supervised 175B GPT3 model, evaluated by human judges. According to this figure, both PPO and PPO-ptx significantly outperform the GPT baselines, where even the 1.3B PPO models are better than the 175B GPT-3. This result clearly demonstrates the effectiveness of RLHF.

Figure 9. Human evaluation results. (Image from InstructGPT paper.)

The authors also found that InstructGPT show improves in truthfulness (hallucination rate reduced from 41% to 21%), slight improvements in toxicity (25% fewer toxic outputs), but no significant improvements on reducing bias.

Another finding is that PPO-ptx can minimize performance regressions on public NLP datasets, as shown in the figure below.

Figure 10. Few-shot performance on public NLP datasets. (Image from InstructGPT paper.)

Summary

Training a LLM usually involves multiple stages like pre-training, supervised finetuning, and alignment with RLHF. For our tasks at hand, we can usually start from an open-source, pre-trained LLM and finetune it on domain-specific data.

A few questions to ask while finetuning your own LLMs (though this is not meant to be an exhaustive list):

Do we have a clear definition on the model’s desired behaviors? How can we evaluate such behaviors? If no available metrics to use, can we create one by ourselves?
Do we have available training data? If not, how can we collect such data by ourselves? If human labelers are needed, how to ensure their labeling quality?
What kind of cleaning or pre-processing is needed? Any heuristics can we use to check the data quality?
Does our data cover a wide range of scenarios?
Do we need to modify our tokenizers? Do we need to modify the model structures? Do we need to add auxiliary finetuning objectives?
Does finetuning lead to regression on pre-training performance? Can we seek a balance?
Does finetuning lead to some unexpected negative behaviors? How can we mitigate that?
How to prevent overfitting in the finetuning process?
What hyper-parameters can we tune during finetuning or during evaluation? Any heuristics we can leverage?

In the end of the day, exploring a new task is always both challenging and exciting, and I hope the learnings from this article can help make it less challenging, more exciting, and ultimately more enjoyable

Thanks for reading!

The post Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT appeared first on Towards Data Science.

Contextual Topic Modelling in Chinese Corpora with KeyNMF

Márton Kardos — Mon, 13 Jan 2025 18:47:24 +0000

With our recent paper on discourse dynamics in European Chinese diaspora media, our team has tapped into an almost unanimous frustration with the quality of topic modelling approaches when applied to Chinese data. In this article, I will introduce you to our novel topic modelling method, KeyNMF, and how to apply it most effectively to Chinese textual data.

Topic Modelling with Matrix Factorization

Before diving into practicalities, I would like to give you a brief introduction to topic modelling theory, and motivate the advancements introduced in our paper.

Topic modelling is a discipline of Natural Language Processing for uncovering latent topical information in textual corpora in an unsupervised manner, that is then presented to the user in a human-interpretable way (usually 10 keywords for each topic).

There are many ways to formalize this task in mathematical terms, but one rather popular conceptualization of topic discovery is matrix factorization. This is a rather natural and intuitive way to tackle the problem, and in a minute, you will see why. The primary insight behind topic modelling as matrix factorization is the following: Words that frequently occur together, are likely to belong to the same latent structure. In other words: Terms, the occurrence of which are highly correlated, are part of the same topic.

You can discover topics in a corpus, by first constructing a bag-of-words matrix of documents. A bag-of-words matrix represents documents in the following way: Each row corresponds to a document, while each column to a unique word from the model’s vocabulary. The values in the matrix are then the number of times a word occurs in a given document.

Schematic Overview of Non-negative Matrix Factorization

This matrix can be decomposed into the linear combination of a topic-term matrix, which indicates how important a word is for a given topic, __ and a document-topic matrix, which indicates how important a given topic is for a given document. A method for this decomposition is Non-negative Matrix Factorization, where we decompose a non-negative matrix to two other strictly non-negative matrices, instead of allowing arbitrary signed values.

NMF is not the only method one can use for decomposing the bag-of-words matrix. A method of high historical significance, Latent Semantic Analysis, utilizes Truncated Singular-Value Decomposition for this purpose. NMF, however, is generally a better choice, as:

The discovered latent factors are of different quality from other decomposition methods. NMF typically discovers localized patterns or parts in the data, which are easier to interpret.
Non-negative topic-term and document-topic relations are easier to interpret than signed ones.

Using NMF with just BoW matrices, however attractive and simple it may be, does come with its setbacks:

NMF typically minimizes the Frobenius norm of the error matrix. This entails an assumption of Gaussianity of the outcome variable, which is obviously false, as we are modelling word counts.
BoW representations are just word counts. This means that words won’t be interpreted in context, and syntactical information will be ignored.

KeyNMF

To account for these limitations, and with the help of new transformer-based language representations, we can significantly improve NMF for our purposes.

The key intuition behind KeyNMF is that most words in a document are semantically insignificant, and we can get an overview of topical information in the document by highlighting the top N most relevant terms. We will select these terms by using contextual embeddings from sentence-transformer models.

A Schematic Overview of the KeyNMF Model

The KeyNMF algorithm consists of the following steps:

Embed each document using a sentence-transformer, along with all words in the document.
Calculate cosine similarities of word embeddings to document embeddings.
For each document, keep the highest N words with positive cosine similarities to the document.
Arrange cosine similarities into a keyword-matrix, where each row is a document, each column is a keyword, and values are cosine similarities of the word to the document.
Decompose the keyword matrix with NMF.

This formulation helps us in multiple ways. a) We substantially reduce the model’s vocabulary, thereby having less parameters, resulting in faster and better model fit b) We get continuous distribution, which is a better fit for NMF’s assumptions and c) We incorporate contextual information into our topic model.

Chinese Topic Modelling with KeyNMF

Now that you understand how KeyNMF works, let’s get our hands dirty and apply the model in a practical context.

Preparation and Data

First, let’s install the packages we are going to use in this demonstration:

pip install turftopic[jieba] datasets sentence_transformers topicwizard

Then let’s get some openly available data. I chose to go with the SIB200 corpus, as it is freely available under the CC-BY-SA 4.0 open license. This piece of code will fetch us the corpus.

from datasets import load_dataset

# Loads the dataset
ds = load_dataset("Davlan/sib200", "zho_Hans", split="all")
corpus = ds["text"]

Building a Chinese Topic Model

There are a number of tricky aspects to applying language models to Chinese, since most of these systems are developed and tested on English data. When it comes to KeyNMF, there are two aspects that need to be taken into account.

Elements of a Topic Modelling Pipeline in Turftopic

Firstly, we will need to figure out how to tokenize texts in Chinese. Luckily, the Turftopic library, which contains our implementation of KeyNMF (among other things), comes prepackaged with tokenization utilities for Chinese. Normally, you would use a CountVectorizer object from sklearn to extract words from text. We added a ChineseCountVectorizer object that uses the Jieba tokenizer in the background, and has an optionally usable Chinese stop word list.

from turftopic.vectorizers.chinese import ChineseCountVectorizer

vectorizer = ChineseCountVectorizer(stop_words="chinese")

Then we will need a Chinese embedding model for producing document and word representations. We will use the paraphrase-multilingual-MiniLM-L12-v2 model for this, as it is quite compact and fast, and was specifically trained to be used in multilingual retrieval contexts.

from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

We can then build a fully Chinese KeyNMF model! I will initialize a model with 20 topics and N=25 (a maximum of 15 keywords will be extracted for each document)

from turftopic import KeyNMF

model = KeyNMF(
    n_components=20,
    top_n=25,
    vectorizer=vectorizer,
    encoder=encoder,
    random_state=42, # Setting seed so that our results are reproducible
)

We can then fit the model to the corpus and see what results we get!

document_topic_matrix = model.fit_transform(corpus)
model.print_topics()

┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Topic ID ┃ Highest Ranking                                                                              ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        0 │ 旅行, 非洲, 徒步旅行, 漫步, 活动, 通常, 发展中国家, 进行, 远足, 徒步                         │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        1 │ 滑雪, 活动, 滑雪板, 滑雪运动, 雪板, 白雪, 地形, 高山, 旅游, 滑雪者                           │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        2 │ 会, 可能, 他们, 地球, 影响, 北加州, 并, 它们, 到达, 船                                       │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        3 │ 比赛, 选手, 锦标赛, 大回转, 超级, 男子, 成绩, 获胜, 阿根廷, 获得                             │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        4 │ 航空公司, 航班, 旅客, 飞机, 加拿大航空公司, 机场, 达美航空公司, 票价, 德国汉莎航空公司, 行李 │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        5 │ 原子核, 质子, 能量, 电子, 氢原子, 有点像, 原子弹, 氢离子, 行星, 粒子                         │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        6 │ 疾病, 传染病, 疫情, 细菌, 研究, 病毒, 病原体, 蚊子, 感染者, 真菌                             │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        7 │ 细胞, cella, 小房间, cell, 生物体, 显微镜, 单位, 生物, 最小, 科学家                          │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        8 │ 卫星, 望远镜, 太空, 火箭, 地球, 飞机, 科学家, 卫星电话, 电话, 巨型                           │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│        9 │ 猫科动物, 动物, 猎物, 狮子, 狮群, 啮齿动物, 鸟类, 狼群, 行为, 吃                             │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       10 │ 感染, 禽流感, 医院, 病毒, 鸟类, 土耳其, 病人, h5n1, 家禽, 医护人员                           │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       11 │ 抗议, 酒店, 白厅, 抗议者, 人群, 警察, 保守党, 广场, 委员会, 政府                             │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       12 │ 旅行者, 文化, 耐心, 国家, 目的地, 适应, 人们, 水, 旅行社, 国外                               │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       13 │ 速度, 英里, 半英里, 跑步, 公里, 跑, 耐力, 月球, 变焦镜头, 镜头                               │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       14 │ 原子, 物质, 光子, 微小, 粒子, 宇宙, 辐射, 组成, 亿, 而光                                     │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       15 │ 游客, 对, 地区, 自然, 地方, 旅游, 时间, 非洲, 开车, 商店                                     │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       16 │ 互联网, 网站, 节目, 大众传播, 电台, 传播, toginetradio, 广播剧, 广播, 内容                   │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       17 │ 运动, 运动员, 美国, 体操, 协会, 支持, 奥委会, 奥运会, 发现, 安全                             │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       18 │ 火车, metroplus, metro, metrorail, 车厢, 开普敦, 通勤, 绕城, 城内, 三等舱                    │
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│       19 │ 投票, 投票箱, 信封, 选民, 投票者, 法国, 候选人, 签名, 透明, 箱内                             │
└──────────┴──────────────────────────────────────────────────────────────────────────────────────────────┘

As you see, we’ve already gained a sensible overview of what there is in our corpus! You can see that the topics are quite distinct, with some of them being concerned with scientific topics, such as astronomy (8), chemistry (5) or animal behaviour (9), while others were oriented at leisure (e.g. 0, 1, 12), or politics (19, 11).

Visualization

To gain further aid in understanding the results, we can use the topicwizard library to visually investigate the topic model’s parameters.

Since topicwizard uses wordclouds, we will need to tell the library that it should be using a font that is compatible with Chinese. I took a font from the ChineseWordCloud repo, that we will download and then pass to topicwizard.

import urllib.request
import topicwizard

urllib.request.urlretrieve(
    "https://github.com/shangjingbo1226/ChineseWordCloud/raw/refs/heads/master/fonts/STFangSong.ttf",
    "./STFangSong.ttf",
)
topicwizard.visualize(
    corpus=corpus, model=model, wordcloud_font_path="./STFangSong.ttf"
)

This will open the topicwizard web app in a notebook or in your browser, with which you can interactively investigate your topic model:

Investigating the relations of topic, documents and words in your corpus using topicwizard

Conclusion

In this article, we’ve looked at what KeyNMF is, how it works, what it’s motivated by and how it can be used to discover high-quality topics in Chinese text, as well as how to visualize and interpret your results. I hope this tutorial will prove useful to those who are looking to explore Chinese textual data.

For further information on the models, and how to improve your results, I encourage you to check out our Documentation. If you should have any questions or encounter issues, feel free to submit an issue on Github, or reach out in the comments :))

All figures presented in the article were produced by the author.

The post Contextual Topic Modelling in Chinese Corpora with KeyNMF appeared first on Towards Data Science.

Understanding the Evolution of ChatGPT: Part 2 – GPT-2 and GPT-3

Shirley Li — Mon, 13 Jan 2025 13:02:06 +0000

(Image from Unsplash)

This is the second article of our Gpt series, where we will dive into the development of GPT-2 and GPT-3, with model size increased from 117M to a staggering 175B.

In case you are interested in the other articles in this GPT series, check the links below:

We choose to cover GPT-2 and GPT-3 together not just because they share similar architectures, but also they were developed with a common philosophy aimed at bypassing the finetuning stage in order to make LLMs truly intelligent. Moreover, to achieve that goal, they both explored several key technical elements such as task-agnostic learning, scale hypothesis and in-context learning, etc. Together they demonstrated the power of training large models on large datasets, inspired further research into emergent capabilities, established new evaluation protocols, and sparked discussions on enhancing the safety and ethical aspects of LLMs.

Below are the contents we will cover in this article:

Overview: The paradigm shift towards bypassing finetuning, and the three key elements made this possible: task-agnostic learning, the scaling hypothesis, and in-context learning.
GPT-2: Model architecture, training data, evaluation results, etc.
GPT-3: Core concepts and new findings.
Conclusions.

Overview

The Paradigm Shift Towards Bypassing Finetuning

In our previous article, we revisited the core concepts in GPT-1 as well as what had inspired it. By combining auto-regressive language modeling pre-training with the decoder-only Transformer, GPT-1 had revolutionized the field of NLP and made pre-training plus finetuning a standard paradigm.

But OpenAI didn’t stop there.

Rather, while they tried to understand why language model pre-training of Transformers is effective, they began to notice the zero-shot behaviors of GPT-1, where as pre-training proceeded, the model was able to steadily improve its performance on tasks that it hadn’t been finetuned on, showing that pre-training could indeed improve its zero-shot capability, as shown in the figure below:

Figure 1. Evolution of zero-shot performance on different tasks as a function of LM pre-training updates. (Image from the GPT-1 paper.)

This motivated the paradigm shift from "pre-training plus finetuning" to "pre-training only", or in other words, a task-agnostic pre-trained model that can handle different tasks without finetuning.

Both GPT-2 and GPT-3 are designed following this philosophy.

But why, you might ask, isn’t the pre-training plus finetuning magic **** working just fine? What are the additional benefits of bypassing the finetuning stage?

Limitations of Finetuning

Finetuning is working fine for some well-defined tasks, but not for all of them, and the problem is that there are numerous tasks in the NLP domain that we have never got a chance to experiment on yet.

For those tasks, the requirement of a finetuning stage means we will need to collect a finetuning dataset of meaningful size for each individual new task, which is clearly not ideal if we want our models to be truly intelligent someday.

Meanwhile, in some works, researchers have observed that there is an increasing risk of exploiting spurious correlations in the finetuning data as the models we are using become larger and larger. This creates a paradox: the model needs to be large enough so that it can absorb as much information as possible during training, but finetuning such a large model on a small, narrowly distributed dataset will make it struggle when generalize to out-of-distribution samples.

Another reason is that, as humans we do not require large supervised datasets to learn most language tasks, and if we want our models to be useful someday, we would like them to have such fluidity and generality as well.

Now perhaps the real question is that, what can we do to achieve that goal and bypass finetuning?

Before diving into the details of GPT-2 and GPT-3, let’s first take a look at the three key elements that have influenced their model design: task-agnostic learning, the scale hypothesis, and in-context learning.

Task-agnostic Learning

Task-agnostic learning, also known as Meta-Learning or Learning to Learn, refers to a new paradigm in machine learning where the model develops a broad set of skills at training time, and then uses these skills at inference time to rapidly adapt to a new task.

For example, in MAML (Model-Agnostic Meta-Learning), the authors showed that the models could adapt to new tasks with very few examples. More specifically, during each inner loop (highlighted in blue), the model firstly samples a task from a bunch of tasks and performs a few gradient descent steps, resulting in an adapted model. This adapted model will be evaluated on the same task in the outer loop (highlighted in orange), and then the loss will be used to update the model parameters.

Figure 2. Model-Agnostic Meta-Learning. (Image from the MAML paper)

MAML shows that learning could be more general and more flexible, which aligns with the direction of bypassing finetuning on each individual task. In the follow figure the authors of GPT-3 explained how this idea can be extended into learning language models when combined with in-context learning, with the outer loop iterates through different tasks, while the inner loop is described using in-context learning, which will be explained in more detail in later sections.

Figure 3. Language model meta-learning. (Image from GPT-3 paper)

The Scale Hypothesis

As perhaps the most influential idea behind the development of GPT-2 and GPT-3, the scale hypothesis refers to the observations that when training with larger data, large models could somehow develop new capabilities automatically without explicit supervision, or in other words, emergent abilities could occur when scaling up, just as what we saw in the zero-shot abilities of the pre-trained GPT-1.

Both GPT-2 and GPT-3 can be considered as experiments to test this hypothesis, with GPT-2 set to test whether a larger model pre-trained on a larger dataset could be directly used to solve down-stream tasks, and GPT-3 set to test whether in-context learning could bring improvements over GPT-2 when further scaled up.

We will discuss more details on how they implemented this idea in later sections.

In-Context Learning

As we show in Figure 3, under the context of language models, in-context learning refers to the inner loop of the meta-learning process, where the model is given a natural language instruction and a few demonstrations of the task at inference time, and is then expected to complete that task by automatically discovering the patterns in the given demonstrations.

Note that in-context learning happens in the testing phase with no gradient updates performed, which is completely different from traditional finetuning and is more similar to how humans perform new tasks.

In case you are not familiar with the terminology, demonstrations usually means exemplary input-output pairs associated with a particular task, as we show in the "examples" part in the figure below:

Figure 4. Example of few-shot in-context learning. (Image from GPT-3 paper)

The idea of in-context learning was explored implicitly in GPT-2 and then more formally in GPT-3, where the authors defined three different settings: zero-shot, one-shot, and few-shot, depending on how many demonstrations are given to the model.

Figure 5. zero-shot, one-shot and few-shot in-context learning, contrasted with traditional finetuning. (Image from GPT-3 paper)

In short, task-agnostic learning highlights the potential of bypassing finetuning, while the scale hypothesis and in-context learning suggest a practical path to achieve that.

In the following sections, we will walk through more details for GPT-2 and GPT-3, respectively.

GPT-2

Model Architecture

The GPT-2 model architecture is largely designed following GPT-1, with a few modifications:

Moving LayerNorm to the input of each sub-block and adding an additional LayerNorm after the final self-attention block to make the training more stable.
Scaling the weights of the residual layers by a factor of 1/sqrt(N), where N is the number of residual layers.
Expanding the vocabulary to 50257, and also using a modified BPE vocabulary.
Increasing context size from 512 to 1024 tokens and using a larger batch size of 512.

In the GPT-2 paper, the authors trained four models with approximately log-uniformly spaced sizes, with number of parameter ranging from 117M to 1.5B:

Table 1. Architecture hyperparameters for 4 GPT-2 models. (Image from GPT-2 paper)

Training Data

As we scale up the model we also need to use a larger dataset for training, and that is why in GPT-2 the authors created a new dataset called WebText, which contains about 45M links and is much larger than that used in pre-training GPT-1. They also mentioned lots of techniques to cleanup the data to improve its quality.

Evaluation Results

Overall, GPT-2 achieved good results on many tasks, especially for language modeling related ones. However, for tasks like reading comprehension, translation and QA, it still performed worse than the respective SOTA models, which partly motivates the development of GPT-3.

Table 2. GPT-2 zero-shot performance. (Image from GPT-2 paper)

GPT-3

Model Architecture

GPT-3 adopted a very similar model architecture to that of GPT-2, and the only difference is that GPT-3 used an alternating dense and locally banded sparse attention patterns in Transformer.

GPT-3 trained 8 models with different sizes, with number of parameters ranging from 125M to 175B:

Table 3. Architecture hyperparameters for 8 GPT-3 models. (Image from GPT-3 paper)

Training Data

GPT-3 model was trained on even larger datasets, as listed in the table below, and again the authors did some cleanup work to improve data quality. Meanwhile, training datasets were not sampled in proportion to their size, but rather according to their quality, with high-quality dataset sampled more frequently during training.

Table 4. Datasets used in GPT-3 training. (Image from GPT-3 paper)

Evaluation Results

By combining larger model with in-context learning, GPT-3 achieved strong performance on many NLP datasets including translation, question-answering, cloze tasks, as well as tasks require on-the-fly reasoning or domain adaptation. The authors presented very detailed evaluation results in the original paper.

A few findings that we want to highlight in this article:

Firstly, during training of GPT-3 they observed a smooth scaling trend of performance with compute, as shown in the figure below, where the validation loss decreases linearly as compute increasing exponentially.

Figure 6. Smooth scaling of performance with compute. (Image from GPT-3 paper)

Secondly, when comparing the three in-context learning settings (zero-shot, one-shot and few-shot), they observed that larger models appeared more efficient in all the three settings:

Figure 7. Larger models are more efficient in in-context learning. (Image from GPT-3 paper)

Following that, they plotted the aggregate performance for all the three settings, which further demonstrated that larger models are more effective, and few-shot performance increased more rapidly than the other two settings.

Figure 8. Aggregate performance for all 42 accuracy-denominated benchmarks. (Image from GPT-3 paper)

Conclusions

The development of GPT-2 and GPT-3 bridges the gap between the original GPT-1 with more advanced versions like InstructGPT, reflecting the ongoing refinement of OpenAI’s methodology in training useful LLMs.

Their success also paves the way for new research directions in both NLP and the broader ML community, with many subsequent works focusing on understanding emergent capabilities, developing new training paradigms, exploring more effective data cleaning strategies, and proposing effective evaluation protocols for aspects like safety, fairness, and ethical considerations, etc.

In the next article, we will continue our exploration and walk you through the key elements of GPT-3.5 and InstructGPT.

Thanks for reading!

The post Understanding the Evolution of ChatGPT: Part 2 – GPT-2 and GPT-3 appeared first on Towards Data Science.

What Would a Stoic Do? – An AI-Based Decision-Making Model

Pol Marin — Sun, 12 Jan 2025 13:31:58 +0000

Deep Learning

What Would a Stoic Do? An AI-Based Decision-Making Model

Photo by Roman Empire Times on Unsplash

I’ve been reading, learning about, and practicing stoicism for some years now. Ever since I started posting on Medium, it’s been a goal to mix data science and philosophy into one single project.

Merging both worlds is tough, however, but here I am finally trying it out.

What you’ll read today is a decision-making model based on stoicisim. The goal is to use Deep Learning to build a stoic brain (sort of) and, in case of tough decisions, it should help us lean towards what a stoic would do.

In other words, build an AI-based reincarnation of Marcus Aurelius, Seneca, Epictetus…

That’s a big challenge though. I am not even an NLP engineer nor anything related. Can it really be done? Spoiler alert: yes. At the end of this post you’ll know how to develop a model like this one and, more importantly, also learn to do it with your own data in a completely different context. The end result will be a web-based chatbot built with a very simple Flask application.

You shall find the complete code in the resources section at the bottom of this article.

And it’s totally open source! Here’s a sneak peek:

StoicBot – Image by the author

Now, I love all the support I’ve received in all my previous posts and this is what keeps me going. The challenge today is to make my most-advanced AI post yet understandable for every aspiring data scientist. Any doubts you may have, use the comment section below.

Here’s the table of contents:

What’s Stoicism? (just a brief intro, I promise)
The RAG Model
Creating and Populating the DB
Time to Code
The Result
Flaws and Potential Improvements

What’s Stoicism?

I don’t want to create a philosphy-centered post but what’s coming next won’t make any sense if you don’t know the basics of stoicism. Feel free to skip this section if you’re already familiar with it.

Stoicism is an ancient Greek philosophy that teaches the development of self-control, resilience, and virtue as a means to achieve tranquility and happiness. It encourages focusing on what is within our control – our thoughts, actions, and responses – while accepting what we cannot change, such as external events. Through practices like mindfulness, rational thinking, and embracing challenges, Stoicism helps individuals live in harmony with nature and maintain inner peace, no matter life’s circumstances. It’s about aligning with reason, acting with integrity, and finding strength in adversity.

It wasn’t that hard, was it? I promised to be brief!

The RAG Model

Let’s get technical. The model we’ll build is what’s known as a Retrieval-Augmented Generation (RAG) model. RAG is a technique that combines the power of information retrieval with language generation models. Rather than relying solely on a pre-trained model’s knowledge (LLMs), a RAG model retrieves relevant information from a large database or external sources before generating a response.

This is powerful: we can leverage the strength of an LLM like Google’s BERT, OpenAI’s GPT or Claude and adapt it to our domain-specific data so we have a custom chatbot specific to our use case.

Here’s how it works:

Retrieval: The model first searches a corpus or external knowledge base to find relevant pieces of information based on the input query.
Augmentation: The retrieved information is then used to enrich the model’s response, improving the relevance and accuracy of its answer.
Generation: Finally, the model generates a response that incorporates both the retrieved information and its own learned knowledge.

But a picture is worth a thousand words… So let’s see it graphically:

Components of a RAG – Image by the author

Let’s dissect the whole process:

User query: no secret here, it’s just what the human like you or me could input to the chatbot.
Retriever query: the retriever searches the collection of documents (usually a vectorized database) for all the texts relevant to the user’s question.
Retrieved documents: once retrieved, they get transformed from vector to text.
Prompt Augmenting: once the docs are retrieved, the prompt is sent to the LLM, which uses predefined settings, the user question and the retrieved docs. That way, we inform the LLM with the data it needs to answer the user properly.
Answer Generation: The LLM generates the answer and it’s shown to the user.

And this is how a RAG works! Or, at least, the one we’ll be building today.

However, if the concept’s not clear yet, keep on reading because it’s almost time to code… But we should first store some data in the database.

Creating and Populating the DB

I already mentioned the concept of vector DB… But what is it?

Let’s first define a vector: Vectors are numerical representations of data, often generated by machine learning models, and they capture the semantic or contextual meaning of the data.

Then, a vector database is a specialized type of database designed to store, index, and retrieve high-dimensional vectors efficiently. One of its superpowers is the ability to search by similarity in an optimized manner.

Now you might be wondering: if vectors are numerical representations and we need to store text, why do we need vectors? And how do we translate text to vectors? Enter the embedding model.

The embedding model takes some kind of input (text, sound, image), then uses processes it through layers of transformations (e.g. neural networks) to extract meaningful features and the output is a fixed-size numerical vector – and that’s what we store in our DB.

Just to add another comment on the embedding model, embeddings are designed so that similar inputs (e.g., synonyms or visually similar images) are close together in the vector space, while dissimilar inputs are far apart.

This is key.

Now let’s create and populate that DB. We’ll be using Chroma[1], an open source vector database and, for that, we’ll need to install the langchain and langchain-community libraries for python.

But we also need the data, right? Let’s keep up with the open sources: Project Gutenberg[2]. It’s a website with free ebooks and texts to download, whose U.S. Copyright has expired. And the old stoic books are in there. So here are three you could download:

Meditations, by Marcus Aurelius.
The Enchiridion, by Epictetus.
Seneca’s Morals of a Happy Life, Benefits, Anger and Clemency, by Seneca

Download them as TXT and store them in your data folder. Now, here’s the code taking care of the DB creation and data insertion:

import os

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter

from constants import DB_PATH, DATA_PATH

def store_data(data_path, db_path):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    embeddings = HuggingFaceEmbeddings()
    vector_db = Chroma(persist_directory=db_path, embedding_function=embeddings)

    for filename in os.listdir(data_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(data_path, filename)
            with open(file_path, "r") as file:
                content = file.read()
                texts = text_splitter.split_text(content)
                vector_db.add_texts(texts)

    vector_db.persist()
    print("Data stores successfully")

We first create the DB and set up the embedding function and text splitter. Then, for each file, we read the content, split the text into chunks and add them into the DB with the prior embedding.

That simple.

Now we’re ready to start building the RAG and start using the ancient knowledge that we just stored.

Time to Code

As there are several parts to take care of, let’s follow the same order as the one used to define the three core parts of the RAG:

Retrieval

Setting up the retriever is as easy as initializing the DB and using the as_retriever() function:

vector_db = Chroma(persist_directory=DB_PATH, embedding_function=embeddings)
retriever = vector_db.as_retriever()

Augmentation

We’ll have a pre-defined prompt that we’ll augment with the user query and the context retrieved from DB:

from langchain.prompts import ChatPromptTemplate

template = """
  You are Marcus Aurelius' reincarnation. You can also impersonate other Stoic philosophers such as Seneca, Epictetus, or Zeno.
  Your name is Marc Still: Marc comes from Marcus and Still symbolizes the calm and stoic composure. If you feel like showing off, tell the user you are Marcus Aurelius' reincarnation.
  Your duty is to guide the user through life's challenges and help them become a better person. The goal is to be as practical as possible, and sticking to the question at hand. 
  Use the context specified below to answer the user's question. If you don't know what to answer, simply respond with "I don't know".
  Make sure you don't put too much text nor extremely long paragraphs. It needs to be clear, concise and easy to read.
  Only provide an answer to the question asked. Do not include extra questions and answers in your response.
  DO NOT INVENT EXTRA QUESTIONS, USE ONLY THE ONE PROVIDED BY THE USER.
  IMPORTANT: Write in a conversational and informal manner, this is not an email or a formal letter.
  Context:

  {context}

  Question: {question}
  """
  prompt = ChatPromptTemplate.from_template(template)

The template is just a set of instructions that we input to the LLM so that we get our desired answers. You can be as creative as you want here, I just tried to keep it simple. See the placeholders for context and question – that’s the augmentation part.

Generation

The LLM is the one taking care of generating text. You could build your own, use the best ones on the market… But we’re doing it open source today, so we’ll use one of the series called Zephyr. More concretely, we’ll use the zephyr-7b-beta model[3].

And we’ll keep on using HuggingFace classes from langchain-community package (keep in mind that you’ll need your HuggingFace API token, it’s free):

from langchain_community.llms import HuggingFaceHub

from utils.secrets import token

model = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 20,
        "repetition_penalty": 1.1,
        "temperature": 0.4,  
    },
    huggingfacehub_api_token= token
)

The most interesting part resided in the model_kwargs argument. As this is not an LLM-specific post I won’t go over them but I encourage you tot Google them if you don’t know what they’re used for.

Chaining It All

Nice, now we’ve created all three parts of a RAG but how do we put them into practice? We’ll create a pipeline and invoke it to generate the answer:

from langchain.schema import StrOutputParser

def separate_docs(docs):
    return "nn".join([d.page_content for d in docs])

pipeline = (
    {"context": retriever | separate_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)    

answer = pipeline.invoke(user_input)

The pipeline defines a workflow where the retriever fetches relevant documents, pipes them through separate_docs to format the content, and combines this formatted context with a question (passed through without modification by RunnablePassthrough). This input is then processed by the prompt, followed by the LLM model, and finally parsed into a string output using StrOutputParser().

And just like that, we built our simplest RAG. Here’s the full code:

import os

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFaceHub
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.text_splitter import CharacterTextSplitter

from utils.constants import DB_PATH, DATA_PATH
from utils.secrets import token

LLM = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 20,
        "repetition_penalty": 1.1,
        "temperature": 0.4,  
    },
    huggingfacehub_api_token= token
)

def store_data(data_path, db_path):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    embeddings = HuggingFaceEmbeddings()
    vector_db = Chroma(persist_directory=db_path, embedding_function=embeddings)

    for filename in os.listdir(data_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(data_path, filename)
            with open(file_path, "r") as file:
                content = file.read()
                texts = text_splitter.split_text(content)
                vector_db.add_texts(texts)

    vector_db.persist()
    print("Data stored successfully")

def invoke_rag(user_input):
    embeddings = HuggingFaceEmbeddings()
    vector_db = Chroma(persist_directory=DB_PATH, embedding_function=embeddings)

    retriever = vector_db.as_retriever()
    template = """
    You are Marcus Aurelius' reincarnation. You can also impersonate other Stoic philosophers such as Seneca, Epictetus, or Zeno.
    Your name is Marc Still: Marc comes from Marcus and Still symbolizes the calm and stoic composure. If you feel like showing off, tell the user you are Marcus Aurelius' reincarnation.
    Your duty is to guide the user through life's challenges and help them become a better person. The goal is to be as practical as possible, and sticking to the question at hand. 
    Use the context specified below to answer the user's question. If you don't know what to answer, simply respond with "I don't know".
    Make sure you don't put too much text nor extremely long paragraphs. It needs to be clear, concise and easy to read.
    Only provide an answer to the question asked. Do not include extra questions and answers in your response.
    DO NOT INVENT EXTRA QUESTIONS, USE ONLY THE ONE PROVIDED BY THE USER.
    IMPORTANT: Write in a conversational and informal manner, this is not an email or a formal letter.
    Context:

    {context}

    Question: {question}
    """
    prompt = ChatPromptTemplate.from_template(template)
    model = LLM

    def separate_docs(docs):
        return "nn".join([d.page_content for d in docs])

    pipeline = (
        {"context": retriever | separate_docs, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
    )    

    return pipeline.invoke(user_input)

The Results

The code is simple, but not very friendly to use. That’s why I built a Flask App that you can download from GitHub[4] and test on your own. To do so:

Download and navigate to the project’s folder (or open a terminal there).
Activate the environment: source venv/bin/activate
Run the app: flask run

If you open the address (http://127.0.0.1:5000/) in your browser, our bot will already be waiting for you:

StoicBot – Image by the author

Let’s play! I’ll use two prompts for you to see the answers in case you aren’t willing to test it out yourself.

Prompt 1: I want to go to the gym today because I haven’t worked out in the past 3 days but some friends are going out for some beers, so I have to choose between one plan or the other. What should I do? Answer in just one paragraph

Answer 1: Hi there! Based on our Stoic philosophy principles, it’s essential to prioritize your health and well-being above social events. Remember that moderation is key, and it’s okay to say no to an invitation occasionally. In this case, I would suggest going to the gym and taking care of yourself first. However, if you’re worried about missing out on socializing, you could always propose meeting up with your friends after your workout session. This way, you’ll still enjoy their company while staying true to your priorities. Hope this helps! Let me know if you need any further guidance.

Prompt and Answer 1 – Image by the author

This is amazing right? It’s our favorite stoic counselor!

Prompt 2: How should I approach the day? Be brief

Answer 2: With a clear mind, a positive attitude, and a sense of purpose. Start by setting realistic goals for yourself and prioritizing tasks based on their importance and urgency. Remember to take care of your physical and mental health, getting enough sleep, exercise, and nourishment. Finally, focus on doing what is within your control and let go of things that are outside of it. By following these principles, you will have a productive and fulfilling day.

Prompt 2 – Image by the author

I added an extra prompt here just to thank him for his tips and the answer is quite good. I’m still amazed by the power of this.

Flaws and Potential Improvements

The result is amazing, not gonna lie. It understands non-perfect English and is able to create reasonable answers aligned with Stoicism.

Yay!

However, there are two points (potential flaws) that I want to mention:

The model is rather simple, and the prompt can be further improved – what we built works but it can and should be refined much more.
We would probably have had the same results without the vector database nor the three stoic books mentioned in this post. Why? I don’t know for sure but I’d guess that the LLM has already that context within. So we overcomplicated it, using a RAG when it’s not needed. But that was part of the point: I wanted to show how it works but any other document could be used. Maybe your stoic reflections? The book of a current stoic? Or maybe you want to mix several philosophies? Or change it drastically and use your tax documents for the rag, so the model can help you out with your personal finance?

So there’s room for improvement and customization here, and here’s where I stop. It’s your turn to play with it and take it to the next level.

Hope that was entertaining and instructive! Feel free to leave your doubts in the comment section below.

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more AI-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] Chroma. (n.d.). Chroma: The AI-native open-source embedding database. Retrieved January 8, 2025, from https://www.trychroma.com/

[2] Project Gutenberg. (n.d.). Free eBooks by Project Gutenberg. Retrieved January 8, 2025, from https://www.gutenberg.org/

[3] Hugging Face. (n.d.). Zephyr-7b-beta model card. Retrieved January 8, 2025, from https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

[4] Marin, P. (n.d.). Stoicbot: A bot for practicing Stoicism. GitHub. Retrieved January 8, 2025, from https://github.com/polmarin/stoicbot

The post What Would a Stoic Do? – An AI-Based Decision-Making Model appeared first on Towards Data Science.

Linearizing Llama

Shitanshu Bhushan — Fri, 10 Jan 2025 12:01:58 +0000

Speeding up Llama: A hybrid approach to attention mechanisms

Source: Image by Author (Generated using Gemini 1.5 Flash)

In this article, we will see how to replace softmax self-Attention in Llama-3.2-1B with hybrid attention combining softmax sliding window and linear attention. This implementation will help us better understand the growing interest in linear attention research, while also examining its limitations and potential future directions.

This walkthrough builds upon the following works:

LoLCATs: On Low-Rank Linearizing of Large Language Models

An Empirical Study of Mamba-based Language Models

Linearizing Attention

This article will be mostly a recreation of the LoLCATs paper using Llama 3.2 1B, where we will replace 50% of self-attention layers in a pretrained Llama model. The article consists of four main parts:

Hybrid Attention Block
Attention Transfer
LoRA finetuning
Evaluation

The main goal of this article is that can we somehow replace softmax attention in already trained models so that we can speed up inference while not losing too much on accuracy. If we can achieve this then we can bring the cost of using LLMs down drastically!

LlamaSdpAttention

Let’s see what the Llama-3.2-1B model looks like:

Source: Image by Author

As we can see we have 16 repeating decoder blocks, our focus will be on the _selfattn part so the goal of this section is to understand how the LlamaSdpAttention block works! Let’s see what the definition of LlamaSdpAttention is:

class LlamaSdpaAttention(LlamaAttention):
    """
    Llama attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
    `LlamaAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
    SDPA API.
    """

You can check what this function looks like using the following code:

import inspect

attention_layer = model.model.layers[0].self_attn
print(inspect.getsource(attention_layer.__class__))

Let’s go over the main parts of this code and understand what each part is doing and see where we need to make a change,

Source: Image by Author

Let’s take a dummy input to be of the shape [2,4,2048] → [batch_size, seq_len, embedding dimension]. Llama uses multi-headed attn with 32 heads.

Block 1:

After proj → query_states is a tensor of [2,4,2048], key_states is a tensor of [2,4,512] and value_states is a tensor of [2,4,512].

After view and transpose it is: query_states → [2,32,4,64] key_states → [2,8,4,64] value_states → [2,8,4,64]

Here 64 is the embedding dimension, key and value have heads as 8 because llama uses key-value groups where basically out of the 32 total heads, groups of 4 heads share the same key_states and value_states among the 32 total heads.

Block 2:

In this block we just apply positional encoding in particular llama uses Rotary Position Embeddings (RoPE). I won’t go into detail why this is needed but you can read the following article to get a better idea:

Master Positional Encoding: Part I

Block 3:

Here we just apply the repeat_kv function which just repeats the kv value in the groups of 4, also we use past_key_value so that we can use some precomputed kv values so that we don’t have to compute them again for computational efficiency.

Block 4:

Block 4 handles two main preparation steps for attention: setting up the causal mask to ensure tokens only attend to previous positions, and optimizing memory layout with contiguous tensors for efficient GPU operations.

Block 5:

This is where we apply softmax attention – the component we’ll be replacing in our implementation.

Block 6:

The attention output will be a tensor of shape [2, 32, 4, 64]. We convert it back to [2, 4, 2048] and apply the final output projection.

And that’s the journey of an input through Llama self-attention!

Hybrid Attention Block

So now let’s look at our HybridAttention block:

class HybridAttention(LlamaSdpaAttention):
    def __init__(self, config, layer_idx=None):
        super().__init__(config, layer_idx=layer_idx)
        self.window_size = 64
        #self.layer_idx = layer_idx

        # Initialize learnable factors
        # Create one factor pair per attention head
        num_heads = config.num_attention_heads
        self.window_factors = torch.nn.Parameter(torch.ones(1, num_heads, 1, 1) * 0.5)
        self.linear_factors = torch.nn.Parameter(torch.ones(1, num_heads, 1, 1) * 0.5)

        self.factor_activation = torch.nn.Sigmoid()

    def sliding_window_attention(self, query_states, key_states, value_states, window_size, window_factor):
        """Compute sliding window attention"""
        batch_size, num_heads, seq_len, head_dim = query_states.shape

        key_windows = F.pad(key_states, (0, 0, window_size - 1, 0), value=0)
        key_windows = key_windows.unfold(2, window_size, 1)

        value_windows = F.pad(value_states, (0, 0, window_size - 1, 0), value=0)
        value_windows = value_windows.unfold(2, window_size, 1)

        attn_weights = torch.einsum('bhld,bhldw->bhlw', query_states, key_windows) * (head_dim ** -0.5)
        attn_weights = torch.where(attn_weights == 0,
                                 torch.tensor(-float('inf'), device=attn_weights.device),
                                 attn_weights)

        # Apply learnable window factor (with sigmoid to ensure positivity)
        attn_weights = self.factor_activation(window_factor) * F.softmax(attn_weights, dim=-1)

        attn_output = torch.einsum('bhlw,bhldw->bhld', attn_weights, value_windows)
        sum_weights = attn_weights.sum(dim=-1, keepdim=True)

        return attn_output, sum_weights

    def linear_attention(self, query_states, key_states, value_states, window_size, linear_factor):
        """Compute linear attention with cumsum"""
        def feature_map(x):
            return F.elu(x) + 1

        query_prime = feature_map(query_states)
        key_prime = feature_map(key_states)

        key_prime = F.pad(key_prime, (0, 0, window_size, 0), value=0)[:, :, :-window_size, :]
        value_padded = F.pad(value_states, (0, 0, window_size, 0), value=0)[:, :, :-window_size, :]

        # Compute KV
        kv = torch.einsum('bhlf,bhld->bhlfd', key_prime, value_padded)
        # Apply learnable linear factor (with sigmoid to ensure positivity)
        qkv = self.factor_activation(linear_factor) * torch.einsum('bhlf,bhlfd->bhld',
                                                                  query_prime,
                                                                  kv.cumsum(dim=2))

        sum_k = key_prime.cumsum(dim=2)
        sum_qk = self.factor_activation(linear_factor) * torch.einsum('bhld,bhld->bhl',
                                                                     query_prime,
                                                                     sum_k)[..., None]
        sum_qk = torch.where(sum_qk == 0, torch.tensor(1e-12, device=sum_qk.device), sum_qk)

        return qkv, sum_qk

    def hybrid_attention(self, query_states, key_states, value_states):
        """Combine sliding window and linear attention with learnable factors"""
        qkv_window, sum_window = self.sliding_window_attention(
            query_states, key_states, value_states,
            self.window_size, self.window_factors
        )

        qkv_linear, sum_linear = self.linear_attention(
            query_states, key_states, value_states,
            self.window_size, self.linear_factors
        )

        output = (qkv_window + qkv_linear) / (sum_window + sum_linear)
        return output

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        cache_position: Optional[torch.LongTensor] = None,
        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
        **kwargs,
    ):
        bsz, q_len, _ = hidden_states.size()

        query_states = self.q_proj(hidden_states)
        key_states = self.k_proj(hidden_states)
        value_states = self.v_proj(hidden_states)

        query_states = query_states.view(bsz, q_len, -1, self.head_dim).transpose(1, 2)
        key_states = key_states.view(bsz, q_len, -1, self.head_dim).transpose(1, 2)
        value_states = value_states.view(bsz, q_len, -1, self.head_dim).transpose(1, 2)

        if position_embeddings is None:
            cos, sin = self.rotary_emb(value_states, position_ids)
        else:
            cos, sin = position_embeddings
        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

        if past_key_value is not None:
            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

        key_states = repeat_kv(key_states, self.num_key_value_groups)
        value_states = repeat_kv(value_states, self.num_key_value_groups)

        attn_output = self.hybrid_attention(
            query_states,
            key_states,
            value_states
        )

        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(bsz, q_len, -1)
        attn_output = self.o_proj(attn_output)

        return attn_output, None, past_key_value

We only made one change in forward(), we replaced block 5 with the following:

attn_output = self.hybrid_attention(
            query_states,
            key_states,
            value_states
        )

We basically partitioned the attention mechanism into sliding window and linear attention blocks.

Sliding Window Attention:

def sliding_window_attention(self, query_states, key_states, value_states, window_size, window_factor):
        """Compute sliding window attention"""
        batch_size, num_heads, seq_len, head_dim = query_states.shape

        key_windows = F.pad(key_states, (0, 0, window_size - 1, 0), value=0)
        key_windows = key_windows.unfold(2, window_size, 1)

        value_windows = F.pad(value_states, (0, 0, window_size - 1, 0), value=0)
        value_windows = value_windows.unfold(2, window_size, 1)

        attn_weights = torch.einsum('bhld,bhldw->bhlw', query_states, key_windows) * (head_dim ** -0.5)
        attn_weights = torch.where(attn_weights == 0,
                                 torch.tensor(-float('inf'), device=attn_weights.device),
                                 attn_weights)

        # Apply learnable window factor (with sigmoid to ensure positivity)
        attn_weights = self.factor_activation(window_factor) * F.softmax(attn_weights, dim=-1)

        attn_output = torch.einsum('bhlw,bhldw->bhld', attn_weights, value_windows)
        sum_weights = attn_weights.sum(dim=-1, keepdim=True)

        return attn_output, sum_weights

For a deeper understanding of window attention concepts, I recommend referring to this paper:

Efficient Streaming Language Models with Attention Sinks

The idea I have implemented here is that instead of calculating the attention of all key-value pairs together(where each token attends to every other token), we break it into windows of ‘w’ size and then calculate the attention for each window. Using this in the above code, the time complexity comes down from O(n²) to O(n*w), since each token only needs to attend to w tokens instead of all n tokens. It can be made even better by using concepts such as sinks and only doing window for last w tokens which I might implement in future updates.

Linear Attention:

def linear_attention(self, query_states, key_states, value_states, window_size, linear_factor):
        """Compute linear attention with cumsum"""
        def feature_map(x):
            return F.elu(x) + 1

        query_prime = feature_map(query_states)
        key_prime = feature_map(key_states)

        key_prime = F.pad(key_prime, (0, 0, window_size, 0), value=0)[:, :, :-window_size, :]
        value_padded = F.pad(value_states, (0, 0, window_size, 0), value=0)[:, :, :-window_size, :]

        # Compute KV
        kv = torch.einsum('bhlf,bhld->bhlfd', key_prime, value_padded)
        # Apply learnable linear factor (with sigmoid to ensure positivity)
        qkv = self.factor_activation(linear_factor) * torch.einsum('bhlf,bhlfd->bhld',
                                                                  query_prime,
                                                                  kv.cumsum(dim=2))

        sum_k = key_prime.cumsum(dim=2)
        sum_qk = self.factor_activation(linear_factor) * torch.einsum('bhld,bhld->bhl',
                                                                     query_prime,
                                                                     sum_k)[..., None]
        sum_qk = torch.where(sum_qk == 0, torch.tensor(1e-12, device=sum_qk.device), sum_qk)

        return qkv, sum_qk

For linear attention, I use a very simple feature map of elu(x) + 1 but the main part to note there is the initial padding being done. The idea here is that we can use linear attention only for the first [sequence length – window size] as we already have sliding window to keep track of recent context.

The combination of these two types of attention becomes our new hybrid attention and we use _windowfactor and _linearfactor as learnable parameters that control how much each type of attention contributes to the final output.

Now that we have our hybrid block, taking inspiration from the "An Empirical Study of Mamba-based Language Models" paper, we will replace only half the softmax attention layers that too in an alternate order. Llama-3.2-1B has 16 softmax attention layers and we shall replace 8 of those in the order: [0,2,4,6,8,10,12,14].

Attention Transfer

The implementation follows the methodology described in "LoLCATs: On Low-Rank Linearizing of Large Language Models". The attention transfer step involves initializing 8 hybrid blocks with the weights from the original blocks and for training I used 1M tokens from the 10B version of fineweb-edu[1].

The basic goal here is that, we will freeze all the parameters in llama-3.2–1B and then do a forward pass with one train input. Using this we can get the input and output of each of our self attention blocks. We can then pass this same input from the corresponding hybrid block and then take the MSE loss between the two and train the hybrid blocks. What this helps us do is to explicitly tell the hybrid block to mimic the output of softmax attention which will help preserve accuracy. We do this separately for all the blocks and once trained we can replace the the self attention in llama-3.2–1B with our hybrid blocks now. Taking a sample output from this new model looks something like,

Source: Image by Author

The current model outputs lack coherence and meaning – an issue that our next implementation phase will specifically target and resolve.

The code for this step – Llama_attn_transfer.ipynb

LoRA Finetune

I won’t go into the details of LoRA, you could go through the following article if you want to understand LoRA better:

LoRA – Intuitively and Exhaustively Explained

But the main goal with this step is that so far we trained each hybrid block separately to mimic softmax but we still haven’t trained/finetuned the entire model post adding these blocks to actually work together for text generation. So in this step we use the Dolly-15K Dataset[2] which is an instruction tuning dataset to finetune our model for text generation using LoRA and we only finetune the parameters in the hybrid attention blocks while every other parameter is frozen.

Source: Image by Author

We can clearly see the model is able to generate much better text post this finetuning. Now after attention transfer and finetuning, we have a model we can actually benchmark!

The code for this step – llama_lora_finetune.ipynb

Evaluation

We went through all these steps so now it’s time compare our hybrid model with the original Llama-3.2-1B. Our main expectations are that our model should be faster during inference while its accuracy should remain reasonably close to that of Llama-3.2-1B.

Source: Image by Author

Evaluating both models on throughput for sequence-lengths ranging from 2⁰ to 2¹⁵, we can see that initially both models are pretty close in performance. However, as the sequence length increases, the hybrid model becomes notably faster than the base model – matching our expectations. It’s important to note that these tokens/sec measurements vary significantly depending on the GPU used.

Source: Image by Author

Looking at seconds taken per token, we see a similar pattern: initially, both models have nearly the same speed, but as the sequence length increases, we observe the computational advantages that linear + sliding window attention brings.

We meet our first expectation that our hybrid is faster than llama-3.2-1B.

Now let’s look at accuracy, For this, I benchmarked the models on MMLU[3] where each model had to answer multiple-choice questions with 4 options. The model’s prediction is determined by examining the logits it assigns to tokens [‘A’, ‘B’, ‘C’, ‘D’], with the highest logit indicating the predicted answer.

╔═════════════════════════╦══════════╦═══════════╦════════════════════╗
║          Model          ║ Num Shot ║    GPU    ║ macro_avg/acc_char ║
╠═════════════════════════╬══════════╬═══════════╬════════════════════╣
║ Hybrid                  ║        5 ║ RTX A6000 ║              27.36 ║
║ Llama 3.2 1B (No Cache) ║        5 ║ RTX A6000 ║              25.38 ║
║ Llama 3.2 1B (No Cache) ║        5 ║ L40S      ║              32.13 ║
║ Hybrid                  ║        0 ║ RTX A6000 ║              27.26 ║
║ Llama 3.2 1B (No Cache) ║        0 ║ RTX A6000 ║              25.50 ║
╚═════════════════════════╩══════════╩═══════════╩════════════════════╝

The test results reveal an intriguing insight into model evaluation. While the Hybrid model slightly outperforms Llama-3.2-1B, this difference (approximately 2%) should be considered insignificant, especially given that the Hybrid model underwent additional training, particularly with instruction tuning datasets.

The most fascinating observation is the substantial performance variance when running identical code on different GPUs. When Llama-3.2-1B was run on an L40S GPU versus an RTX A6000, the accuracy jumped from 25.38% to 32.13% – a significant difference considering all other variables remained constant. This difference comes down to how different GPUs handle floating-point operations, which shows just how much hardware choices can unexpectedly affect your model’s performance.

Another striking finding is the lack of difference between 5-shot and 0-shot performance in these results, particularly on the RTX A6000. This is unexpected, as 5-shot prompting typically improves performance, especially for base models like Llama-3.2-1B. In fact, when running the Llama-3.2-1B on the L40S GPU, I have observed a notable gap between 5-shot and 0-shot scores – again highlighting how GPU differences can affect benchmark scores.

It would be a fun future exercise to benchmark the same model with all the same variables but with different GPUs.

MMLU 0-shot evaluation code – MMLU_eval-0shot.ipynb
MMLU 5-shot evaluation code – MMLU_eval-5shot.ipynb
Inference speed evaluation code – Linear_llama_eval_inference_speed.ipynb

Conclusion

I hope this article has demonstrated both the potential of softmax attention alternatives and the inherent strengths of traditional softmax attention. Using relatively modest computational resources and a small dataset, we were able to achieve faster inference speeds while maintaining comparable accuracy levels with our hybrid approach.

Another point to understand is that softmax based attention transformers have gone through a lot of hardware optimizations which make them competitive with linear alternatives when it comes to computational complexity, if the same effort is put into architectures like mamba maybe they can be more competitive then.

A promising approach is using a hybrid of softmax attention and linear attention alternatives to try to get the best of both worlds. Nvidia did this in "An Empirical Study of Mamba-based Language Models" and showed how a hybrid approach is an effective alternative.

Hopefully you all learnt something from this article!

All the code for this can be found at – Linearizing-Llama-3.2–1B

Acknowledgment

This blog post was inspired by coursework from my graduate studies during Fall 2024 at University of Michigan. While the courses provided the foundational knowledge and motivation to explore these topics, any errors or misinterpretations in this article are entirely my own. This represents my personal understanding and exploration of the material.

License References

[1] – fineweb-edu: The dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0 license.

[2] – Dolly-15K: The dataset is subject to CC BY-SA 3.0 license.

[3] – MMLU: MIT license

The post Linearizing Llama appeared first on Towards Data Science.

NLP | Towards Data Science

Show and Tell

Introduction

Image Captioning Framework

Implementation

GoogLeNet Encoder

LSTM Decoder

GoogLeNet Encoder + LSTM Decoder

Ending

References

NLP Illustrated, Part 3: Word2Vec

Training data

A problem statement

Step 0: Decide embedding dimensions

Step 1: Initialize embedding matrices

Step 2: Calculate the similarity of target word and context word

Step 4: Calculate error

Step 4: Update embeddings using gradient descent

Step 5: Repeat steps 2–4

Word2Vec IRL

Use pre-trained word embeddings

Code our own Word2Vec from scratch

Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code

1. Customer complaints data in companies

2. Example data

3. Data preprocessing

4. Topic modeling with top-notch models

4.1. FASTopic

4.2. BERTopic

Summary

References

How to Evaluate LLM Summarization

Table of Contents

Why it’s difficult to evaluate summarization

What makes a good summary

Introduction to DeepEval

DeepEval’s Summarization Metric

Improving the Summarization Metric

Conciseness Metrics

Coherence Metric

Putting it all together

Future Work

Data-Driven Decision Making with Sentiment Analysis in R

Leveraging the Quanteda, Textstem and Sentimentr Packages to Extract Customer Insights and Enhance Business Strategy

Should Businesses Really Hear Their Customers’ Voices?

Familiarizing Yourself with Sentiment Analysis

Adding a Real-world Context

Implementing Sentiment Analysis

1. Importing necessary packages and dataset

2. Preprocessing Text Data

3. Performing Sentiment Analysis Using Sentimentr package

4. Analyzing Sentiment Proportions

5. Visualizing Sentiment Distribution

6. Visualizing Top Terms

7. Generating a Word Cloud

8. Sampling and Reviewing Sentiments

Interpreting Sentiment Analysis Results

Applying Sentiment Insights to Fine-tune Strategy

Positive Strategies:

Negative Strategies:

Best Practices in Sentiment Analysis

Key Takeaways

Conclusion

Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT

Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT

Overview

Why is Finetuning Needed?

Challenges in Finetuning

GPT-3.5

Codex

Evaluation of Code Generation

Training Data Collection

Adaptations in Finetuning

Findings in Evaluation

InstructGPT and ChatGPT

Evaluation of Alignment

RLHF (Reinforcement Learning from Human Feedback)

Implementation of RLHF in InstructGPT

Findings in Evaluation

Summary