Courage to Learn ML: Demystifying L1 & L2 Regularization (part 4)

Welcome back to ‘Courage to Learn ML: Unraveling L1 & L2 Regularization,’ in its fourth post. Last time, our mentor-learner pair explored the properties of L1 and L2 regularization through the lens of Lagrange Multipliers.

In this concluding segment on L1 and L2 regularization, the duo will delve into these topics from a fresh angle – Bayesian priors. We’ll also summarize how L1 and L2 regularizations are applied across different algorithms.

In this article, we’ll address several intriguing questions. If any of these topics spark your curiosity, you’ve come to the right place!

How MAP priors relate to L1 and L2 regularizations
An intuitive breakdown of using Laplace and normal distributions as priors
Understanding the sparsity induced by L1 regularization with a Laplace prior
Algorithms that are compatible with L1 and L2 regularization
Why L2 regularization is often referred to as ‘weight decay’ in neural network training
The reasons behind the less frequent use of L1 norm in neural networks

So, we’ve talked about how MAP differs from MLE, mainly because MAP takes into account an extra piece of information: our beliefs before seeing the data, or the prior. How does this tie in with L1 and L2 regularizations?

Let’s dive into how different priors in the MAP formula shape our approach to L1 and L2 regularization (for a detailed walkthrough on formulating this equation, check out this post).

When considering priors for weights, Our initial intuition often leads us to choose a normal distribution as the prior for model weights. With this, we typically use a zero-mean normal distribution for each weight wi, sharing the same standard deviation 𝜎. Plugging this belief into the prior term logp(w) in MAP (where p(w) represents the weight’s prior) leads us to sum of squared weights naturally. This term is precisely the L2 norm. This implies that using a normal distribution as our prior equates to applying L2 regularization.

Conversely, adopting a Laplace distribution as our belief results in the L1 norm for weights. Hence, a Laplace prior essentially translates to L1 regularization.

In short, L1 regularization aligns with a Laplace distribution prior, while L2 regularization corresponds to a Normal distribution prior.

Interestingly, when employing a uniform prior in the MAP framework, it essentially "disappears" from the equation (go ahead and try it yourself!). This leaves the likelihood term as the sole determinant of the optimal weight values, effectively transforming the MAP estimation into maximum likelihood estimation (MLE).

So, can you explain the reasoning for having different beliefs when our prior is a Laplace distribution versus a normal distribution? I’d like to visualize this better.

This is a great question. Indeed, having different priors means you hold various initial assumptions about the situation before collecting any data. We’ll delve into the purpose of different distributions later, but for now, let’s look at a simple, intuitive example using Laplace and normal distributions. Consider the number of views on my new Medium posts. Two weeks ago, as a new writer with no followers, I expected zero views. My assumption was that the average daily view count would start low, possibly at zero, but might increase as readers interested in similar topics discover my work. A Laplace prior fits this scenario well. It suggests a range of possible view counts but assigns higher probability to numbers near zero, reflecting my expectation of few views initially but allowing for growth over time.

Now, with 55 viewers (thanks, everyone!), and followers who receive updates on my posts, my expectations have changed. I anticipate that new posts will perform similarly to my previous ones, averaging around my historical view count. This is where a normal distribution prior comes into play, predicting future views based on my established track record.

Hmm… Can you explain the L1 regularization sparsity with a Laplace prior?

Indeed, understanding L1 regularization’s promotion of sparsity can be illuminated by comparing the Laplace distribution to the normal distribution. The key difference lies in their probability densities around zero.** The Laplace distribution is sharply peaked at zero, indicating a higher likelihood of values close to zero. This characteristic mirrors the effect of L1 regularization, where most weights in the model are driven towards zero, promoting sparsity. In contrast, the normal distribution, associated with L2 regularization, is less peaked at zero and more spread out, indicating a preference for distributing weights more evenly.

source: https://austinrochford.com/posts/2013-09-02-prior-distributions-for-bayesian-regression-using-pymc.html

Additionally, the Laplace distribution has heavier tails than the normal distribution, meaning it extends further out. This property allows for some weights to remain significantly away from zero while the others are pretty close to zero. So, by choosing the Laplace distribution as a prior for the weights (L1 regularization) , we encourage the model to learn solutions where most weights are close to zero, achieving sparsity without sacrificing potentially relevant features. This is why L1 regularization can be used as a feature selection method.

So, I see that L1 and L2 regularizations are key for avoiding overfitting and boosting a model’s generalizability. Can you tell me which algorithms these methods can be applied to?

L1, L2 regularization can be apply to many algorithms by adding a penalty term to their loss functions. Here are some specific examples of algorithms where L1 and L2 regularization are applied:

Linear models. Those techniques are particularly useful with high-dimensional problems. In linear models, they are known as lasso and ridge regression, respectively. One thing to note is that L1 regularization not only helps prevent overfitting but also helps with feature selection which preventing multi-collinearity.
SVM. Regularization methods are the core of SVM. By adding the weight penalty term, SVM encourages the model to reduce its margin between decision boundary and the closest support vector (L1 regularization) or smooth the margin (L2 regularization), which leads to better generalization.
Neural Networks. L2 regularization is more commonly used in neural networks and is often referred to as weight decay. L1 regularization can also be used in neural networks, but it is less common due to its tendency to lead to sparse weights.
Ensemble algorithms. Gradient boosting machines like GBM and Xgboost use L1 and L2 regularization to limit the size of individual trees within the ensemble. L1 regularization specifically achieves this by shrinking the weights of weak learners (each tree) in the ensemble, while L2 regularization penalizes the total number of leaves (leaf score) in each tree.

Why do L2 regularization is also called ‘weight decay’ in neural network training? And why is the L1 norm less commonly used in neural networks?

To tackle those two questions, let’s bring in a bit of math to illustrate how weights get updated in the presence of L1 and L2 regularizations.

Loss function with L2 regularization; λ is penalty coefficient and α represents learning rate

In L2 regularization, the weight update process involves a slight reduction of the weights, scaled down according to their own magnitude. This results in what is termed as "weight decay." Specifically, each weight is decreased by an amount that is directly proportional to its current value. This proportional reduction, governed by the typically small settings of the penalty coefficient (λ) and the learning rate (α), ensures that larger weights are subjected to a higher degree of penalization compared to smaller weights. The essence of weight decay lies in this method of scaling down weights, encouraging the model to maintain smaller weights. Such behavior is advantageous in neural networks as it tends to produce smoother decision boundary.

Loss function with L1 regularization; λ is penalty coefficient and α represents learning rate

In contrast, L1 regularization modifies the weight update rule by subtracting or adding a constant amount, determined by αλ and the sign of the weight (w). This approach pushes weights towards zero, regardless of whether they are positive or negative. Under L1 regularization, all weights, irrespective of their magnitude, are adjusted by the same fixed amount. This results in larger weights remaining relatively large, while smaller weights are more rapidly driven to zero, promoting sparsity in the network.

Comparing the two, L2’s approach to weight modification is based on the weight’s existing value, leading to larger weights diminishing more quickly than smaller ones. This uniform decay across all weights is why it’s termed ‘weight decay’. On the other hand, L1’s fixed adjustment amount, regardless of weight size, can lead to some issues and become less favorable in NN:

It can zero out some weights, causing ‘dead neurons’ and potentially disrupting information flow within the network, which could impair model performance.
The non-differentiable points at zero introduced by L1 make optimization algorithms like gradient descent less effective.

What effects do adding L1 and L2 regularization have on our loss function? Does incorporating these regularizations lead us away from the original global minimum?

It’s a great question! In short, once we incorporate regularization, we intentionally shift our focus away from the original global minimum. This means adding penalty terms to the loss function, fundamentally changing its landscape. It’s crucial to understand that this change is desirable, not accidental.

By introducing these penalties, we aim to achieve a new optimal solution that balances two crucial goals: fitting the training data well to minimize empirical risk while simultaneously reducing model complexity and enhancing generalization to unseen data. The original global minimum might not achieve this balance, potentially leading to overfitting and poor performance on new data.

If you’re interested in the mathematical details of measuring the distance between the original and regularized optima, I highly recommend chapter 7 (pages 224–229) of Deep Learning by Ian Goodfellow. Pay particular attention to formulas 7.7 and 7.13 for L2 and 7.22 and 7.23 for L1. This provides a quantifiable assessment of the impact regularization terms have on weights, deepening your understanding of L1 and L2 regularization.

We’ve now reached the conclusion of our exploration into L1 and L2 regularization. In our next discussion, I’m excited to delve into the basics of loss functions. A big thank you to all the readers who enjoyed the first part of this series. Initially, my goal was to solidify my grasp of basic ML concepts, but I’m thrilled to see it resonate with many of you 😃 . If you have suggestions for our next topic, please feel free to leave a comment!