Artificial Intelligence | Towards Data Science

Avoiding Costly Mistakes with Uncertainty Quantification for Algorithmic Home Valuations

Arne Johan Pollestad — Tue, 08 Apr 2025 00:40:22 +0000

When you’re about to buy a home, whether you’re an everyday buyer looking for your dream house or a seasoned property investor, there’s a good chance you’ve encountered automated valuation models, or AVMs. These clever tools use massive datasets filled with past property transactions to predict the value of your potential new home. By considering features like location, number of bedrooms, bathrooms, property age, and more, AVMs use AI to learn associations with sales prices. A rapid and low-cost appraisal of any home sounds great on paper, and in many cases it is great. However, with every price prediction comes a level of uncertainty, and failing to consider this uncertainty can be a costly mistake. In this post, I illustrate the application of AI-uncertainty quantification for AVMs through the AVMU methodology.

Price Prediction Uncertainty?

Let’s start off simple. Imagine you’re looking for a two-story, four-bedroom house in a cozy neighborhood in Virginia Beach, VA. You’ve downloaded some local housing data and used it to train your own AVM (you’re tech-savvy like that!).

Case 1: Lucky you, several almost identical homes in the neighborhood have sold for around $500,000 in the past year. Your AVM confidently suggests the home you’re interested in will also likely be worth around the same price. Easy enough, right?

But here’s where it gets trickier:

Case 2: This time, no similar two-story, four-bedroom homes have sold recently. Instead, your dataset shows smaller, one-story homes selling at $400,000, and larger, three-story homes going for $600,000. Your AVM averages things out and again suggests $500,000. It makes sense, your target house is bigger than the cheaper homes and smaller than the pricier ones.

Both scenarios gave you the same $500,000 valuation. However, there’s a catch: The first scenario is backed by solid data (similar homes selling recently), making the price prediction quite reliable. In the second scenario, on the other hand, trusting the price prediction might be a bit riskier. With fewer comparable sales, the AVM had to make “an educated guess”, leading to a less certain price prediction.

The solid AVM in Case 1 is a very helpful decision support tool for purchasing a home, but the shaky AVM in Case 2 can give you a totally wrong idea of the home’s market value. Here’s the big question:

How can you tell whether your AVM prediction is solid or shaky?

AVMU—An Uncertainty Quantification Technique for AVMs

This is exactly why we need AVMU, or Automated Valuation Model Uncertainty. AVMU is a recent methodological framework that helps us quantify exactly how reliable (or uncertain) these AVM predictions are. Think of it as a confidence meter for your house price prediction, helping you make smarter decisions instead of blindly trusting an algorithm.

Let’s return to our Virginia Beach example. You’ve browsed listings extensively and narrowed your choices down to two fantastic homes: let’s call them Home A and Home B.

Image by Author, made partly with DALL-E.

Of course, the first thing you want to know is their market values. Knowing the market value ensures you don’t overpay, potentially saving you from future financial headaches and having to resell the home at a loss. Unfortunately, you don’t have much knowledge about house prices in Virginia Beach, as you’re originally from [insert name of the place you grew up]. Fortunately, you recall the data science skills you picked up in grad school and confidently decide to build your own AVM to get a grasp of the market values of your two candidate homes.

To ensure your AVM predictions are as accurate as possible, you train the model using Mean Squared Error (MSE) as your loss function:

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2\]

Here, $ n $ is the number of homes in your training dataset, $ \hat{y}_i $ represents the AVM’s price prediction for home $ i $, and $ y_i $ is the actual price at which home $ i $ was sold.

Image by Author, made partly with DALL-E.

After training the model, you eagerly apply your AVM to Homes A and B. To your surprise (or perhaps excitement?), both homes are valued at exactly $500,000 by the algorithm. Very well, but just as you’re about to place an offer on home B, a thought strikes: these predictions aren’t absolute certainties. They’re “point predictions”, essentially the AVM’s best guess at the most likely market value. In fact, the true market value is probably somewhat higher or lower, and it’s rather unlikely that the AVM prediction nailed the market value down to the exact dollar.

So, how do we measure this uncertainty? This is where AVMU methodology comes into play, with a straightforward but powerful approach:

First, you use cross-validation (e.g., 5-fold CV) to generate out-of-fold price predictions, $ \hat{y}_i $, for all the $ n $ homes in your dataset.
Next, for each home, you calculate how far off the prediction was from the actual sales price. This difference is called the absolute deviation, $ |\hat{y}_i – y_i| $, between the price prediction, $ \hat{y}_i $, and the actual sales price, $ y_i $.
Then, instead of predicting sales prices, you train a separate “uncertainty model”, $ F(\hat{y}_i, x_i) $, using these absolute deviations, $ |\hat{y}_i – y_i| $, as the target. This special model learns patterns indicating when the AVM predictions are typically accurate or uncertain.
Finally, you apply this uncertainty model to estimate how uncertain the price predictions are for Homes A and B (i.e., your test set), by predicting their absolute price deviations. You now have simple uncertainty estimates for both of the homes.

Now, I know exactly what some of you might be thinking about the third step:

“Wait a second, you can’t just put a regression on top of another regression to explain why the first one is off!”

And you’d be absolutely right. Well, sort of. If there were clear, predictable data patterns showing that certain homes were consistently overpriced or underpriced by your AVM, that would mean your AVM wasn’t very good in the first place. Ideally, a good AVM should capture all meaningful patterns in the data. But here’s the clever twist: instead of predicting if a home is specifically overpriced or underpriced (what we call the signed deviation), we focus on absolute deviations. By doing this, we sidestep the issue of explaining if a home is valued too high or too low. Instead, we let the uncertainty model focus on identifying which types of homes the AVM tends to predict accurately and which ones it struggles with, no matter the direction of the error.

From a homebuyer’s perspective, you’re naturally more worried about overpaying. Imagine buying a home for $500,000 only to discover it’s actually worth just $400,000! But in practice, underestimating the value of a home is also more problematic than you’d think. Make an offer that’s too low, and you might just lose your dream home to another buyer. That’s why, as a savvy buyer equipped with AVM predictions, your goal isn’t just to chase the highest or lowest price prediction. Instead, your priority should be robust, reliable valuations that closely match the true market value. And thanks to the AVMU uncertainty estimates, you can now more confidently pinpoint exactly which predictions to trust.

Mathematically, the process described above can be written like this:

\[|\hat{y}_i – y_i| = F(\hat{y}_i, x_i) + \varepsilon_i \quad \text{for } 1 \leq i \leq n\]

and:

\[\text{AVMU}_i = F(\hat{y}_i, x_i)\]

The uncertainty model, $ F(\hat{y}_i, x_i) $, can be based on any regression algorithm (even the same one as your AVM). The difference is, for your uncertainty model you’re not necessarily interested in achieving perfect predictions for the absolute deviations. Instead, you’re interested in ranking the homes based on prediction uncertainty, and thereby learn which out of Home A’s and Home B’s price predictions you can trust the most. The MSE loss function used for the AVM (see first equation), might therefore not be the ideal choice.

Rather than using MSE, you therefore fit your uncertainty model, $ F(\hat{y}_i, x_i) $, to optimize a loss function more suited for ranking. An example of such a loss function is to maximize rank correlation (i.e., Spearman’s $ \rho $), given by:

\[\rho = 1 – \frac{6 \sum_{i=1}^{n} D_i^2}{n(n^2 – 1)}\]

Here, a higher $ \rho $ means your model ranks homes better regarding prediction uncertainty. $ D_i $ represents the difference in ranks between actual absolute deviations, $ |\hat{y}_i – y_i| $, and predicted uncertainties, $ \text{AVMU}_i = F(\hat{y}_i, x_i) $, for home $ i $.

Image by Author, made partly with DALL-E.

So now you have, for both candidate homes, an AVM price prediction and a corresponding AVMU uncertainty estimate. By combining these two measures, you quickly notice something interesting: even if multiple homes share the same “most likely market value”, the reliability of that predictions can vary greatly. In your case, you see that Home B comes with a significantly higher AVMU uncertainty estimate, signaling that its actual market value could stray far from the $500,000 valuation.

To protect yourself from the unnecessary risk, you wisely opt for purchasing Home A, whose AVM valuation of $500,000 is backed by stronger certainty. With confidence restored thanks to the AVMU, you happily finalize your purchase, knowing you’ve made a smart, data-informed choice, and celebrate your new home with a relaxing drink in your new front yard.

Image by Author, made partly with DALL-E.

Ethics and Other Applications of AVMU

This simple introduction to AVM price uncertainty and how AVMU can guide you when buying a home is just one of its many potential applications. Homes aren’t the only assets that could benefit from quick, low-cost valuation tools. While AVMs are commonly associated with housing due to plentiful data and easily identifiable characteristics, these models, and their uncertainty quantification via AVMU, can apply to virtually anything with a market price. Think about used cars, collectibles, or even pro soccer players. As long as there’s uncertainty in predicting their prices, AVMU can be used to understand it.

Sticking with housing, purchasing decisions aren’t the only area where AVMU could be used. Mortgage lenders frequently use AVMs to estimate the collateral value of properties, yet often overlook how uneven the accuracy of these price predictions can be. Similarly, tax authorities can use AVMs to determine your property taxes but may accidentally set unfair valuations due to unacknowledged uncertainty. Recognizing uncertainty through AVMU can help make these valuations fairer and more accurate across the board.

However, despite its versatility, it’s essential to remember neither AVMU is perfect. It’s still a statistical model relying on data quality and quantity. No model can completely eliminate uncertainty, especially the random aspects inherent in most markets, sometimes referred to as aleatoric or irreducible uncertainty. Imagine a newlywed couple falling head-over-heels for a particular kitchen, prompting them to bid way above the typical market value. Or perhaps bad weather negatively influencing someone’s perception of a house during a viewing. Such unpredictable scenarios will always exist, and AVMU can’t account for every outlier.

Remember, AVMU gives you probabilities, not fixed truths. A home with a higher AVMU uncertainty is more likely to experience price deviations, it is not a guaranteed. And if you find yourself thinking, “should I make third model to predict the uncertainty of my uncertainty model?”, it’s probably time to accept that some uncertainty is simply unavoidable. So, armed with your AVMU-informed insights, relax, embrace the uncertainty, and enjoy your new home!

References

A. J. Pollestad, A. B. Næss and A. Oust, Towards a Better Uncertainty Quantification in Automated Valuation Models (2024), The Journal of Real Estate Finance and Economics.
A. J. Pollestad and A. Oust, Harnessing uncertainty: a new approach to real estate investment decision support (2025), Quantitative Finance.

The post Avoiding Costly Mistakes with Uncertainty Quantification for Algorithmic Home Valuations appeared first on Towards Data Science.

A Simple Implementation of the Attention Mechanism from Scratch

Marcello Politi — Tue, 01 Apr 2025 01:05:51 +0000

Introduction

The Attention Mechanism is often associated with the transformer architecture, but it was already used in RNNs. In Machine Translation or MT (e.g., English-Italian) tasks, when you want to predict the next Italian word, you need your model to focus, or pay attention, on the most important English words that are useful to make a good translation.

I will not go into details of RNNs, but attention helped these models to mitigate the vanishing gradient problem and to capture more long-range dependencies among words.

At a certain point, we understood that the only important thing was the attention mechanism, and the entire RNN architecture was overkill. Hence, Attention is All You Need!

Self-Attention in Transformers

Classical attention indicates where words in the output sequence should focus attention in relation to the words in input sequence. This is important in sequence-to-sequence tasks like MT.

The self-attention is a specific type of attention. It operates between any two elements in the same sequence. It provides information on how “correlated” the words are in the same sentence.

For a given token (or word) in a sequence, self-attention generates a list of attention weights corresponding to all other tokens in the sequence. This process is applied to each token in the sentence, obtaining a matrix of attention weights (as in the picture).

This is the general idea, in practice things are a bit more complicated because we want to add many learnable parameters to our neural network, let’s see how.

K, V, Q representations

Our model input is a sentence like “my name is Marcello Politi”. With the process of tokenization, a sentence is converted into a list of numbers like [2, 6, 8, 3, 1].

Before feeding the sentence into the transformer we need to create a dense representation for each token.

How to create this representation? We multiply each token by a matrix. The matrix is learned during training.

Let’s add some complexity now.

For each token, we create 3 vectors instead of one, we call these vectors: key, value and query. (We see later how we create these 3 vectors).

Conceptually these 3 tokens have a particular meaning:

The vector key represents the core information captured by the token
The vector value captures the full information of a token
The vector query, it’s a question about the token relevance for the current task.

So the idea is that we focus on a particular token i , and we want to ask what is the importance of the other tokens in the sentence regarding the token i we are taking into consideration.

This means that we take the vector q_i (we ask a question regarding i) for token i, and we do some mathematical operations with all the other tokens k_j (j!=i). This is like wondering at first glance what are the other tokens in the sequence that look really important to understand the meaning of token i.

What is this magical mathematical operation?

We need to multiply (dot-product) the query vector by the key vectors and divide by a scaling factor. We do this for each k_j token.

In this way, we obtain a score for each pair (q_i, k_j). We make this list become a probability distribution by applying a softmax operation on it. Great now we have obtained the attention weights!

With the attention weights, we know what is the importance of each token k_j to for undestandin the token i. So now we multiply the value vector v_j associated with each token per its weight and we sum the vectors. In this way we obtain the final context-aware vector of token_i.

If we are computing the contextual dense vector of token_1 we calculate:

z1 = a11*v1 + a12*v2 + … + a15*v5

Where a1j are the computer attention weights, and v_j are the value vectors.

Done! Almost…

I didn’t cover how we obtained the vectors k, v and q of each token. We need to define some matrices w_k, w_v and w_q so that when we multiply:

token * w_k -> k
token * w_q -> q
token * w_v -> v

These 3 matrices are set at random and are learned during training, this is why we have many parameters in modern models such as LLMs.

Multi-head Self-Attention in Transformers (MHSA)

Are we sure that the previous self-attention mechanism is able to capture all important relationships among tokens (words) and create dense vectors of those tokens that really make sense?

It could actually not work always perfectly. What if to mitigate the error we re-run the entire thing 2 times with new w_q, w_k and w_v matrices and somehow merge the 2 dense vectors obtained? In this way maybe one self-attention managed to capture some relationship and the other managed to capture some other relationship.

Well, this is what exactly happens in MHSA. The case we just discussed contains two heads because it has two sets of w_q, w_k and w_v matrices. We can have even more heads: 4, 8, 16 etc.

The only complicated thing is that all these heads are managed in parallel, we process the all in the same computation using tensors.

The way we merge the dense vectors of each head is simple, we concatenate them (hence the dimension of each vector shall be smaller so that when concat them we obtain the original dimension we wanted), and we pass the obtained vector through another w_o learnable matrix.

Hands-on

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">Python">import torch

Suppose you have a sentence. After tokenization, each token (word for simplicity) corresponds to an index (number):

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">tokenized_sentence = torch.tensor([
    2, #my
    6, #name
    8, #is
    3, #marcello
    1  #politi
])
tokenized_sentence

Before feeding the sentence into the transofrmer we need to create a dense representation for each token.

How to create these representation? We multiply each token per a matrix. This matrix is learned during training.

Let’s build this embedding matrix.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">torch.manual_seed(0) # set a fixed seed for reproducibility
embed = torch.nn.Embedding(10, 16)

If we multiply our tokenized sentence with the embeddings, we obtain a dense representation of dimension 16 for each token

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">sentence_embed = embed(tokenized_sentence).detach()
sentence_embed

In order to use the attention mechanism we need to create 3 new We define 3 matrixes w_q, w_k and w_v. When we multiply one input token time the w_q we obtain the vector q. Same with w_k and w_v.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">d = sentence_embed.shape[1] # let's base our matrix on a shape (16,16)

w_key = torch.rand(d,d)
w_query = torch.rand(d,d)
w_value = torch.rand(d,d)

Compute attention weights

Let’s now compute the attention weights for only the first input token of the sentence.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">token1_embed = sentence_embed[0]

#compute the tre vector associated to token1 vector : q,k,v
key_1 = w_key.matmul(token1_embed)
query_1 = w_query.matmul(token1_embed)
value_1 = w_value.matmul(token1_embed)

print("key vector for token1: \n", key_1)   
print("query vector for token1: \n", query_1)
print("value vector for token1: \n", value_1)

We need to multiply the query vector associated to token1 (query_1) with all the keys of the other vectors.

So now we need to compute all the keys (key_2, key_2, key_4, key_5). But wait, we can compute all of these in one time by multiplying the sentence_embed times the w_k matrix.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">keys = sentence_embed.matmul(w_key.T)
keys[0] #contains the key vector of the first token and so on

Let’s do the same thing with the values

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">values = sentence_embed.matmul(w_value.T)
values[0] #contains the value vector of the first token and so on

Let’s compute the first part of the attions formula.

import torch.nn.functional as F

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># the following are the attention weights of the first tokens to all the others
a1 = F.softmax(query_1.matmul(keys.T)/d**0.5, dim = 0)
a1

With the attention weights we know what is the importance of each token. So now we multiply the value vector associated to each token per its weight.

To obtain the final context aware vector of token_1.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">z1 = a1.matmul(values)
z1

In the same way we could compute the context aware dense vectors of all the other tokens. Now we are always using the same matrices w_k, w_q, w_v. We say that we use one head.

But we can have multiple triplets of matrices, so multi-head. That’s why it is called multi-head attention.

The dense vectors of an input tokens, given in oputut from each head are at then end concatenated and linearly transformed to get the final dense vector.

Implementing MultiheadSelf-Attention

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0) # fixed seed for reproducibility

Same steps as before…

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Tokenized sentence (same as yours)
tokenized_sentence = torch.tensor([2, 6, 8, 3, 1])  # [my, name, is, marcello, politi]

# Embedding layer: vocab size = 10, embedding dim = 16
embed = nn.Embedding(10, 16)
sentence_embed = embed(tokenized_sentence).detach()  # Shape: [5, 16] (seq_len, embed_dim)

We’ll define a multi-head attention mechanism with h heads (let’s say 4 heads for this example). Each head will have its own w_q, w_k, and w_v matrices, and the output of each head will be concatenated and passed through a final linear layer.

Since the output of the head will be concatenated, and we want a final dimension of d, the dimension of each head needs to be d/h. Additionally each concatenated vector will go though a linear transformation, so we need another matrix w_ouptut as you can see in the formula.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">d = sentence_embed.shape[1]  # embed dimension 16
h = 4  # Number of heads
d_k = d // h  # Dimension per head (16 / 4 = 4)

Since we have 4 heads, we want 4 copies for each matrix. Instead of copies, we add a dimension, which is the same thing, but we only do one operation. (Imagine stacking matrices on top of each other, its the same thing).

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Define weight matrices for each head
w_query = torch.rand(h, d, d_k)  # Shape: [4, 16, 4] (one d x d_k matrix per head)
w_key = torch.rand(h, d, d_k)    # Shape: [4, 16, 4]
w_value = torch.rand(h, d, d_k)  # Shape: [4, 16, 4]
w_output = torch.rand(d, d)  # Final linear layer: [16, 16]

I’m using for simplicity torch’s einsum. If you’re not familiar with it check out my blog post.

The einsum operation torch.einsum('sd,hde->hse', sentence_embed, w_query) in PyTorch uses letters to define how to multiply and rearrange numbers. Here’s what each part means:

Input Tensors:
- sentence_embed with the notation 'sd':
  - s represents the number of words (sequence length), which is 5.
  - d represents the number of numbers per word (embedding size), which is 16.
  - The shape of this tensor is [5, 16].
- w_query with the notation 'hde':
  - h represents the number of heads, which is 4.
  - d represents the embedding size, which again is 16.
  - e represents the new number size per head (d_k), which is 4.
  - The shape of this tensor is [4, 16, 4].
Output Tensor:
- The output has the notation 'hse':
  - h represents 4 heads.
  - s represents 5 words.
  - e represents 4 numbers per head.
  - The shape of the output tensor is [4, 5, 4].

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Compute Q, K, V for all tokens and all heads
# sentence_embed: [5, 16] -> Q: [4, 5, 4] (h, seq_len, d_k)
queries = torch.einsum('sd,hde->hse', sentence_embed, w_query)  # h heads, seq_len tokens, d dim
keys = torch.einsum('sd,hde->hse', sentence_embed, w_key)       # h heads, seq_len tokens, d dim
values = torch.einsum('sd,hde->hse', sentence_embed, w_value)   # h heads, seq_len tokens, d dim

This einsum equation performs a dot product between the queries (hse) and the transposed keys (hek) to obtain scores of shape [h, seq_len, seq_len], where:

h -> Number of heads.
s and k -> Sequence length (number of tokens).
e -> Dimension of each head (d_k).

The division by (d_k ** 0.5) scales the scores to stabilize gradients. Softmax is then applied to obtain attention weights:

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Compute attention scores
scores = torch.einsum('hse,hek->hsk', queries, keys.transpose(-2, -1)) / (d_k ** 0.5)  # [4, 5, 5]
attention_weights = F.softmax(scores, dim=-1)  # [4, 5, 5]

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Apply attention weights
head_outputs = torch.einsum('hij,hjk->hik', attention_weights, values)  # [4, 5, 4]
head_outputs.shape

Now we concatenate all the heads of token 1

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Concatenate heads
concat_heads = head_outputs.permute(1, 0, 2).reshape(sentence_embed.shape[0], -1)  # [5, 16]
concat_heads.shape

Let’s finally multiply per the last w_output matrix as in the formula above

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">multihead_output = concat_heads.matmul(w_output)  # [5, 16] @ [16, 16] -> [5, 16]
print("Multi-head attention output for token1:\n", multihead_output[0])

Final Thoughts

In this blog post I’ve implemented a simple version of the attention mechanism. This is not how it is really implemented in modern frameworks, but my scope is to provide some insights to allow anyone an understanding of how this works. In future articles I’ll go through the entire implementation of a transformer architecture.

Follow me on TDS if you like this article!

Linkedin | X (Twitter) | Website

Unless otherwise noted, images are by the author

The post A Simple Implementation of the Attention Mechanism from Scratch appeared first on Towards Data Science.