Statistics | Towards Data Science https://towardsdatascience.com/tag/statistics/ The world’s leading publication for data science, AI, and ML professionals. Fri, 11 Apr 2025 18:38:58 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Statistics | Towards Data Science https://towardsdatascience.com/tag/statistics/ 32 32 Are You Sure Your Posterior Makes Sense? https://towardsdatascience.com/are-you-sure-your-posterior-makes-sense/ Fri, 11 Apr 2025 18:38:41 +0000 https://towardsdatascience.com/?p=605717 A detailed guide on how to use diagnostics to evaluate the performance of MCMC samplers

The post Are You Sure Your Posterior Makes Sense? appeared first on Towards Data Science.

]]>
This article is co-authored by Felipe Bandeira, Giselle Fretta, Thu Than, and Elbion Redenica. We also thank Prof. Carl Scheffler for his support.

Introduction

Parameter estimation has been for decades one of the most important topics in statistics. While frequentist approaches, such as Maximum Likelihood Estimations, used to be the gold standard, the advance of computation has opened space for Bayesian methods. Estimating posterior distributions with Mcmc samplers became increasingly common, but reliable inferences depend on a task that is far from trivial: making sure that the sampler — and the processes it executes under the hood — worked as expected. Keeping in mind what Lewis Caroll once wrote: “If you don’t know where you’re going, any road will take you there.”

This article is meant to help data scientists evaluate an often overlooked aspect of Bayesian parameter estimation: the reliability of the sampling process. Throughout the sections, we combine simple analogies with technical rigor to ensure our explanations are accessible to data scientists with any level of familiarity with Bayesian methods. Although our implementations are in Python with PyMC, the concepts we cover are useful to anyone using an MCMC algorithm, from Metropolis-Hastings to NUTS. 

Key Concepts

No data scientist or statistician would disagree with the importance of robust parameter estimation methods. Whether the objective is to make inferences or conduct simulations, having the capacity to model the data generation process is a crucial part of the process. For a long time, the estimations were mainly performed using frequentist tools, such as Maximum Likelihood Estimations (MLE) or even the famous Least Squares optimization used in regressions. Yet, frequentist methods have clear shortcomings, such as the fact that they are focused on point estimates and do not incorporate prior knowledge that could improve estimates.

As an alternative to these tools, Bayesian methods have gained popularity over the past decades. They provide statisticians not only with point estimates of the unknown parameter but also with confidence intervals for it, all of which are informed by the data and by the prior knowledge researchers held. Originally, Bayesian parameter estimation was done through an adapted version of Bayes’ theorem focused on unknown parameters (represented as θ) and known data points (represented as x). We can define P(θ|x), the posterior distribution of a parameter’s value given the data, as:

\[ P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)} \]

In this formula, P(x|θ) is the likelihood of the data given a parameter value, P(θ) is the prior distribution over the parameter, and P(x) is the evidence, which is computed by integrating all possible values of the prior:

\[ P(x) = \int_\theta P(x, \theta) d\theta \]

In some cases, due to the complexity of the calculations required, deriving the posterior distribution analytically was not possible. However, with the advance of computation, running sampling algorithms (especially MCMC ones) to estimate posterior distributions has become easier, giving researchers a powerful tool for situations where analytical posteriors are not trivial to find. Yet, with such power also comes a large amount of responsibility to ensure that results make sense. This is where sampler diagnostics come in, offering a set of valuable tools to gauge 1) whether an MCMC algorithm is working well and, consequently, 2) whether the estimated distribution we see is an accurate representation of the real posterior distribution. But how can we know so?

How samplers work

Before diving into the technicalities of diagnostics, we shall cover how the process of sampling a posterior (especially with an MCMC sampler) works. In simple terms, we can think of a posterior distribution as a geographical area we haven’t been to but need to know the topography of. How can we draw an accurate map of the region?  

One of our favorite analogies comes from Ben Gilbert. Suppose that the unknown region is actually a house whose floorplan we wish to map. For some reason, we cannot directly visit the house, but we can send bees inside with GPS devices attached to them. If everything works as expected, the bees will fly around the house, and using their trajectories, we can estimate what the floor plan looks like. In this analogy, the floor plan is the posterior distribution, and the sampler is the group of bees flying around the house.

The reason we are writing this article is that, in some cases, the bees won’t fly as expected. If they get stuck in a certain room for some reason (because someone dropped sugar on the floor, for example), the data they return won’t be representative of the entire house; rather than visiting all rooms, the bees only visited a few, and our picture of what the house looks like will ultimately be incomplete. Similarly, when a sampler does not work correctly, our estimation of the posterior distribution is also incomplete, and any inference we draw based on it is likely to be wrong.

Monte Carlo Markov Chain (MCMC)

In technical terms, we call an MCMC process any algorithm that undergoes transitions from one state to another with certain properties. Markov Chain refers to the fact that the next state only depends on the current one (or that the bee’s next location is only influenced by its current place, and not by all of the places where it has been before). Monte Carlo means that the next state is chosen randomly. MCMC methods like Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo (HMC), and No-U-Turn Sampler (NUTS) all operate by constructing Markov Chains (a sequence of steps) that are close to random and gradually explore the posterior distribution.

Now that you understand how a sampler works, let’s dive into a practical scenario to help us explore sampling problems.

Case Study

Imagine that, in a faraway nation, a governor wants to understand more about public annual spending on healthcare by mayors of cities with less than 1 million inhabitants. Rather than looking at sheer frequencies, he wants to understand the underlying distribution explaining expenditure, and a sample of spending data is about to arrive. The problem is that two of the economists involved in the project disagree about how the model should look.

Model 1

The first economist believes that all cities spend similarly, with some variation around a certain mean. As such, he creates a simple model. Although the specifics of how the economist chose his priors are irrelevant to us, we do need to keep in mind that he is trying to approximate a Normal (unimodal) distribution.

\[
x_i \sim \text{Normal}(\mu, \sigma^2) \text{ i.i.d. for all } i \\
\mu \sim \text{Normal}(10, 2) \\
\sigma^2 \sim \text{Uniform}(0,5)
\]

Model 2

The second economist disagrees, arguing that spending is more complex than his colleague believes. He believes that, given ideological differences and budget constraints, there are two kinds of cities: the ones that do their best to spend very little and the ones that are not afraid of spending a lot. As such, he creates a slightly more complex model, using a mixture of normals to reflect his belief that the true distribution is bimodal.

\[
x_i \sim \text{Normal-Mixture}([\omega, 1-\omega], [m_1, m_2], [s_1^2, s_2^2]) \text{ i.i.d. for all } i\\
m_j \sim \text{Normal}(2.3, 0.5^2) \text{ for } j = 1,2 \\
s_j^2 \sim \text{Inverse-Gamma}(1,1) \text{ for } j=1,2 \\
\omega \sim \text{Beta}(1,1)
\]

After the data arrives, each economist runs an MCMC algorithm to estimate their desired posteriors, which will be a reflection of reality (1) if their assumptions are true and (2) if the sampler worked correctly. The first if, a discussion about assumptions, shall be left to the economists. However, how can they know whether the second if holds? In other words, how can they be sure that the sampler worked correctly and, as a consequence, their posterior estimations are unbiased?

Sampler Diagnostics

To evaluate a sampler’s performance, we can explore a small set of metrics that reflect different parts of the estimation process.

Quantitative Metrics

R-hat (Potential Scale Reduction Factor)

In simple terms, R-hat evaluates whether bees that started at different places have all explored the same rooms at the end of the day. To estimate the posterior, an MCMC algorithm uses multiple chains (or bees) that start at random locations. R-hat is the metric we use to assess the convergence of the chains. It measures whether multiple MCMC chains have mixed well (i.e., if they have sampled the same topography) by comparing the variance of samples within each chain to the variance of the sample means across chains. Intuitively, this means that

\[
\hat{R} = \sqrt{\frac{\text{Variance Between Chains}}{\text{Variance Within Chains}}}
\]

If R-hat is close to 1.0 (or below 1.01), it means that the variance within each chain is very similar to the variance between chains, suggesting that they have converged to the same distribution. In other words, the chains are behaving similarly and are also indistinguishable from one another. This is precisely what we see after sampling the posterior of the first model, shown in the last column of the table below:

Figure 1. Summary statistics of the sampler highlighting ideal R-hats.

The r-hat from the second model, however, tells a different story. The fact we have such large r-hat values indicates that, at the end of the sampling process, the different chains had not converged yet. In practice, this means that the distribution they explored and returned was different, or that each bee created a map of a different room of the house. This fundamentally leaves us without a clue of how the pieces connect or what the complete floor plan looks like.

Figure 2. Summary statistics of the sampler showcasing problematic R-hats.

Given our R-hat readouts were large, we know something went wrong with the sampling process in the second model. However, even if the R-hat had turned out within acceptable levels, this does not give us certainty that the sampling process worked. R-hat is just a diagnostic tool, not a guarantee. Sometimes, even if your R-hat readout is lower than 1.01, the sampler might not have properly explored the full posterior. This happens when multiple bees start their exploration in the same room and remain there. Likewise, if you’re using a small number of chains, and if your posterior happens to be multimodal, there is a probability that all chains started in the same mode and failed to explore other peaks. 

The R-hat readout reflects convergence, not completion. In order to have a more comprehensive idea, we need to check other diagnostic metrics as well.

Effective Sample Size (ESS)

When explaining what MCMC was, we mentioned that “Monte Carlo” refers to the fact that the next state is chosen randomly. This does not necessarily mean that the states are fully independent. Even though the bees choose their next step at random, these steps are still correlated to some extent. If a bee is exploring a living room at time t=0, it will probably still be in the living room at time t=1, even though it is in a different part of the same room. Due to this natural connection between samples, we say these two data points are autocorrelated.

Due to their nature, MCMC methods inherently produce autocorrelated samples, which complicates statistical analysis and requires careful evaluation. In statistical inference, we often assume independent samples to ensure that the estimates of uncertainty are accurate, hence the need for uncorrelated samples. If two data points are too similar to each other, the correlation reduces their effective information content. Mathematically, the formula below represents the autocorrelation function between two time points (t1 and t2) in a random process:

\[
R_{XX}(t_1, t_2) = E[X_{t_1} \overline{X_{t_2}}]
\]

where E is the expected value operator and X-bar is the complex conjugate. In MCMC sampling, this is crucial because high autocorrelation means that new samples don’t teach us anything different from the old ones, effectively reducing the sample size we have. Unsurprisingly, the metric that reflects this is called Effective Sample Size (ESS), and it helps us determine how many truly independent samples we have. 

As hinted previously, the effective sample size accounts for autocorrelation by estimating how many truly independent samples would provide the same information as the autocorrelated samples we have. Mathematically, for a parameter θ, the ESS is defined as:

\[
ESS = \frac{n}{1 + 2 \sum_{k=1}^{\infty} \rho(\theta)_k}
\]

where n is the total number of samples and ρ(θ)k is the autocorrelation at lag k for parameter θ.

Typically, for ESS readouts, the higher, the better. This is what we see in the readout for the first model. Two common ESS variations are Bulk-ESS, which assesses mixing in the central part of the distribution, and Tail-ESS, which focuses on the efficiency of sampling the distribution’s tails. Both inform us if our model accurately reflects the central tendency and credible intervals.

Figure 3. Summary statistics of the sampler highlighting ideal quantities for ESS bulk and tail.

In contrast, the readouts for the second model are very bad. Typically, we want to see readouts that are at least 1/10 of the total sample size. In this case, given each chain sampled 2000 observations, we should expect ESS readouts of at least 800 (from the total size of 8000 samples across 4 chains of 2000 samples each), which is not what we observe.

Figure 4. Summary statistics of the sampler demonstrating problematic ESS bulk and tail.

Visual Diagnostics

Apart from the numerical metrics, our understanding of sampler performance can be deepened through the use of diagnostic plots. The main ones are rank plots, trace plots, and pair plots.

Rank Plots

A rank plot helps us identify whether the different chains have explored all of the posterior distribution. If we once again think of the bee analogy, rank plots tell us which bees explored which parts of the house. Therefore, to evaluate whether the posterior was explored equally by all chains, we observe the shape of the rank plots produced by the sampler. Ideally, we want the distribution of all chains to look roughly uniform, like in the rank plots generated after sampling the first model. Each color below represents a chain (or bee):

Figure 5. Rank plots for parameters ‘m’ and ‘s’ across four MCMC chains. Each bar represents the distribution of rank values for one chain, with ideally uniform ranks indicating good mixing and proper convergence.

Under the hood, a rank plot is produced with a simple sequence of steps. First, we run the sampler and let it sample from the posterior of each parameter. In our case, we are sampling posteriors for parameters m and s of the first model. Then, parameter by parameter, we get all samples from all chains, put them together, and order them from smallest to largest. We then ask ourselves, for each sample, what was the chain where it came from? This will allow us to create plots like the ones we see above. 

In contrast, bad rank plots are easy to spot. Unlike the previous example, the distributions from the second model, shown below, are not uniform. From the plots, what we interpret is that each chain, after beginning at different random locations, got stuck in a region and did not explore the entirety of the posterior. Consequently, we cannot make inferences from the results, as they are unreliable and not representative of the true posterior distribution. This would be equivalent to having four bees that started at different rooms of the house and got stuck somewhere during their exploration, never covering the entirety of the property.

Figure 6. Rank plots for parameters m, s_squared, and w across four MCMC chains. Each subplot shows the distribution of ranks by chain. There are noticeable deviations from uniformity (e.g., stair-step patterns or imbalances across chains) suggesting potential sampling issues.

KDE and Trace Plots

Similar to R-hat, trace plots help us assess the convergence of MCMC samples by visualizing how the algorithm explores the parameter space over time. PyMC provides two types of trace plots to diagnose mixing issues: Kernel Density Estimate (KDE) plots and iteration-based trace plots. Each of these serves a distinct purpose in evaluating whether the sampler has properly explored the target distribution.

The KDE plot (usually on the left) estimates the posterior density for each chain, where each line represents a separate chain. This allows us to check whether all chains have converged to the same distribution. If the KDEs overlap, it suggests that the chains are sampling from the same posterior and that mixing has occurred. On the other hand, the trace plot (usually on the right) visualizes how parameter values change over MCMC iterations (steps), with each line representing a different chain. A well-mixed sampler will produce trace plots that look noisy and random, with no clear structure or separation between chains.

Using the bee analogy, trace plots can be thought of as snapshots of the “features” of the house at different locations. If the sampler is working correctly, the KDEs in the left plot should align closely, showing that all bees (chains) have explored the house similarly. Meanwhile, the right plot should show highly variable traces that blend together, confirming that the chains are actively moving through the space rather than getting stuck in specific regions.

Figure 7. Density and trace plots for parameters m and s from the first model across four MCMC chains. The left panel shows kernel density estimates (KDE) of the marginal posterior distribution for each chain, indicating consistent central tendency and spread. The right panel displays the trace plot over iterations, with overlapping chains and no apparent divergences, suggesting good mixing and convergence.

However, if your sampler has poor mixing or convergence issues, you will see something like the figure below. In this case, the KDEs will not overlap, meaning that different chains have sampled from different distributions rather than a shared posterior. The trace plot will also show structured patterns instead of random noise, indicating that chains are stuck in different regions of the parameter space and failing to fully explore it.

Figure 8. KDE (left) and trace plots (right) for parameters m, s_squared, and w across MCMC chains for the second model. Multimodal distributions are visible for m and w, suggesting potential identifiability issues. Trace plots reveal that chains explore different modes with limited mixing, particularly for m, highlighting challenges in convergence and effective sampling.

By using trace plots alongside the other diagnostics, you can identify sampling issues and determine whether your MCMC algorithm is effectively exploring the posterior distribution.

Pair Plots

A third kind of plot that is often useful for diagnostic are pair plots. In models where we want to estimate the posterior distribution of multiple parameters, pair plots allow us to observe how different parameters are correlated. To understand how such plots are formed, think again about the bee analogy. If you imagine that we’ll create a plot with the width and length of the house, each “step” that the bees take can be represented by an (x, y) combination. Likewise, each parameter of the posterior is represented as a dimension, and we create scatter plots showing where the sampler walked using parameter values as coordinates. Here, we are plotting each unique pair (x, y), resulting in the scatter plot you see in the middle of the image below. The one-dimensional plots you see on the edges are the marginal distributions over each parameter, giving us additional information on the sampler’s behavior when exploring them.

Take a look at the pair plot from the first model.

Figure 9. Joint posterior distribution of parameters m and s, with marginal densities. The scatter plot shows a roughly symmetric, elliptical shape, suggesting a low correlation between m and s.

Each axis represents one of the two parameters whose posteriors we are estimating. For now, let’s focus on the scatter plot in the middle, which shows the parameter combinations sampled from the posterior. The fact we have a very even distribution means that, for any particular value of m, there was a range of values of s that were equally likely to be sampled. Additionally, we don’t see any correlation between the two parameters, which is usually good! There are cases when we would expect some correlation, such as when our model involves a regression line. However, in this instance, we have no reason to believe two parameters should be highly correlated, so the fact we don’t observe unusual behavior is positive news. 

Now, take a look at the pair plots from the second model.

Figure 10. Pair plot of the joint posterior distributions for parameters m, s_squared, and w. The scatter plots reveal strong correlations between several parameters.

Given that this model has five parameters to be estimated, we naturally have a greater number of plots since we are analyzing them pair-wise. However, they look odd compared to the previous example. Namely, rather than having an even distribution of points, the samples here either seem to be divided across two regions or seem somewhat correlated. This is another way of visualizing what the rank plots have shown: the sampler did not explore the full posterior distribution. Below we isolated the top left plot, which contains the samples from m0 and m1. Unlike the plot from model 1, here we see that the value of one parameter greatly influences the value of the other. If we sampled m1 around 2.5, for example, m0 is likely to be sampled from a very narrow range around 1.5.

Figure 11. Joint posterior distribution of parameters m₀ and m₁, with marginal densities.

Certain shapes can be observed in problematic pair plots relatively frequently. Diagonal patterns, for example, indicate a high correlation between parameters. Banana shapes are often connected to parametrization issues, often being present in models with tight priors or constrained parameters. Funnel shapes might indicate hierarchical models with bad geometry. When we have two separate islands, like in the plot above, this can indicate that the posterior is bimodal AND that the chains haven’t mixed well. However, keep in mind that these shapes might indicate problems, but not necessarily do so. It’s up to the data scientist to examine the model and determine which behaviors are expected and which ones are not!

Some Fixing Techniques

When your diagnostics indicate sampling problems — whether concerning R-hat values, low ESS, unusual rank plots, separated trace plots, or strange parameter correlations in pair plots — several strategies can help you address the underlying issues. Sampling problems typically stem from the target posterior being too complex for the sampler to explore efficiently. Complex target distributions might have:

  • Multiple modes (peaks) that the sampler struggles to move between
  • Irregular shapes with narrow “corridors” connecting different regions
  • Areas of drastically different scales (like the “neck” of a funnel)
  • Heavy tails that are difficult to sample accurately

In the bee analogy, these complexities represent houses with unusual floor plans — disconnected rooms, extremely narrow hallways, or areas that change dramatically in size. Just as bees might get trapped in specific regions of such houses, MCMC chains can get stuck in certain areas of the posterior.

Figure 12. Examples of multimodal target distributions.
Figure 13. Examples of weirdly shaped distributions.

To help the sampler in its exploration, there are simple strategies we can use.

Strategy 1: Reparameterization

Reparameterization is particularly effective for hierarchical models and distributions with challenging geometries. It involves transforming your model’s parameters to make them easier to sample. Back to the bee analogy, imagine the bees are exploring a house with a peculiar layout: a spacious living room that connects to the kitchen through a very, very narrow hallway. One aspect we hadn’t mentioned before is that the bees have to fly in the same way through the entire house. That means that if we dictate the bees should use large “steps,” they will explore the living room very well but hit the walls in the hallway head-on. Likewise, if their steps are small, they will explore the narrow hallway well, but take forever to cover the entire living room. The difference in scales, which is natural to the house, makes the bees’ job more difficult.

A classic example that represents this scenario is Neal’s funnel, where the scale of one parameter depends on another:

\[
p(y, x) = \text{Normal}(y|0, 3) \times \prod_{n=1}^{9} \text{Normal}(x_n|0, e^{y/2})
\]

Figure 14. Log the marginal density of y and the first dimension of Neal’s funnel. The neck is where the sampler is struggling to sample from and the step size is required to be much smaller than the body. (Image source: Stan User’s Guide)

We can see that the scale of x is dependent on the value of y. To fix this problem, we can separate x and y as independent standard Normals and then transform these variables into the desired funnel distribution. Instead of sampling directly like this:

\[
\begin{align*}
y &\sim \text{Normal}(0, 3) \\
x &\sim \text{Normal}(0, e^{y/2})
\end{align*}
\]

You can reparameterize to sample from standard Normals first:

\[
y_{raw} \sim \text{Standard Normal}(0, 1) \\
x_{raw} \sim \text{Standard Normal}(0, 1) \\
\\
y = 3y_{raw} \\
x = e^{y/2} x_{raw}
\]

This technique separates the hierarchical parameters and makes sampling more efficient by eliminating the dependency between them. 

Reparameterization is like redesigning the house such that instead of forcing the bees to find a single narrow hallway, we create a new layout where all passages have similar widths. This helps the bees use a consistent flying pattern throughout their exploration.

Strategy 2: Handling Heavy-tailed Distributions

Heavy-tailed distributions like Cauchy and Student-T present challenges for samplers and the ideal step size. Their tails require larger step sizes than their central regions (similar to very long hallways that require the bees to travel long distances), which creates a challenge:

  • Small step sizes lead to inefficient sampling in the tails
  • Large step sizes cause too many rejections in the center
Figure 15. Probability density functions for various Cauchy distributions illustrate the effects of changing the location parameter and scale parameter. (Image source: Wikipedia)

Reparameterization solutions include:

  • For Cauchy: Defining the variable as a transformation of a Uniform distribution using the Cauchy inverse CDF
  • For Student-T: Using a Gamma-Mixture representation

Strategy 3: Hyperparameter Tuning

Sometimes the solution lies in adjusting the sampler’s hyperparameters:

  • Increase total iterations: The simplest approach — give the sampler more time to explore.
  • Increase target acceptance rate (adapt_delta): Reduce divergent transitions (try 0.9 instead of the default 0.8 for complex models, for example).
  • Increase max_treedepth: Allow the sampler to take more steps per iteration.
  • Extend warmup/adaptation phase: Give the sampler more time to adapt to the posterior geometry.

Remember that while these adjustments may improve your diagnostic metrics, they often treat symptoms rather than underlying causes. The previous strategies (reparameterization and better proposal distributions) typically offer more fundamental solutions.

Strategy 4: Better Proposal Distributions

This solution is for function fitting processes, rather than sampling estimations of the posterior. It basically asks the question: “I’m currently here in this landscape. Where should I jump to next so that I explore the full landscape, or how do I know that the next jump is the jump I should make?” Thus, choosing a good distribution means making sure that the sampling process explores the full parameter space instead of just a specific region. A good proposal distribution should:

  1. Have substantial probability mass where the target distribution does.
  2. Allow the sampler to make jumps of the appropriate size.

One common choice of the proposal distribution is the Gaussian (Normal) distribution with mean μ and standard deviation σ — the scale of the distribution that we can tune to decide how far to jump from the current position to the next position. If we choose the scale for the proposal distribution to be too small, it might either take too long to explore the entire posterior or it will get stuck in a region and never explore the full distribution. But if the scale is too large, you might never get to explore some regions, jumping over them. It’s like playing ping-pong where we only reach the two edges but not the middle.

Improve Prior Specification

When all else fails, reconsider your model’s prior specifications. Vague or weakly informative priors (like uniformly distributed priors) can sometimes lead to sampling difficulties. More informative priors, when justified by domain knowledge, can help guide the sampler toward more reasonable regions of the parameter space. Sometimes, despite your best efforts, a model may remain challenging to sample effectively. In such cases, consider whether a simpler model might achieve similar inferential goals while being more computationally tractable. The best model is often not the most complex one, but the one that balances complexity with reliability. The table below shows the summary of fixing strategies for different issues.

Diagnostic SignalPotential IssueRecommended Fix
High R-hatPoor mixing between chainsIncrease iterations, adjust the step size
Low ESSHigh autocorrelationReparameterization, increase adapt_delta
Non-uniform rank plotsChains stuck in different regionsBetter proposal distribution, start with multiple chains
Separated KDEs in trace plotsChains exploring different distributionsReparameterization
Funnel shapes in pair plotsHierarchical model issuesNon-centered reparameterization
Disjoint clusters in pair plotsMultimodality with poor mixingAdjusted distribution, simulated annealing

Conclusion

Assessing the quality of MCMC sampling is crucial for ensuring reliable inference. In this article, we explored key diagnostic metrics such as R-hat, ESS, rank plots, trace plots, and pair plots, discussing how each helps determine whether the sampler is performing properly.

If there’s one takeaway we want you to keep in mind it’s that you should always run diagnostics before drawing conclusions from your samples. No single metric provides a definitive answer — each serves as a tool that highlights potential issues rather than proving convergence. When problems arise, strategies such as reparameterization, hyperparameter tuning, and prior specification can help improve sampling efficiency.

By combining these diagnostics with thoughtful modeling decisions, you can ensure a more robust analysis, reducing the risk of misleading inferences due to poor sampling behavior.

References

B. Gilbert, Bob’s bees: the importance of using multiple bees (chains) to judge MCMC convergence (2018), Youtube

Chi-Feng, MCMC demo (n.d.), GitHub

D. Simpson, Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it. (2019), Statistical Modeling, Causal Inference, and Social Science

M. Taboga, Markov Chain Monte Carlo (MCMC) methods (2021), Lectures on probability theory and mathematical Statistics. Kindle Direct Publishing. 

T. Wiecki, MCMC Sampling for Dummies (2024), twecki.io
Stan User’s Guide, Reparametrization (n.d.), Stan Documentation

The post Are You Sure Your Posterior Makes Sense? appeared first on Towards Data Science.

]]>
How to Measure Real Model Accuracy When Labels Are Noisy https://towardsdatascience.com/how-to-measure-real-model-accuracy-when-labels-are-noisy/ Thu, 10 Apr 2025 19:22:26 +0000 https://towardsdatascience.com/?p=605709 The math behind “true” accuracy and error correlation

The post How to Measure Real Model Accuracy When Labels Are Noisy appeared first on Towards Data Science.

]]>
Ground truth is never perfect. From scientific measurements to human annotations used to train deep learning models, ground truth always has some amount of errors. ImageNet, arguably the most well-curated image dataset has 0.3% errors in human annotations. Then, how can we evaluate predictive models using such erroneous labels?

In this article, we explore how to account for errors in test data labels and estimate a model’s “true” accuracy.

Example: image classification

Let’s say there are 100 images, each containing either a cat or a dog. The images are labeled by human annotators who are known to have 96% accuracy (Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ). If we train an image classifier on some of this data and find that it has 90% accuracy on a hold-out set (Aᵐᵒᵈᵉˡ), what is the “true” accuracy of the model (Aᵗʳᵘᵉ)? A couple of observations first:

  1. Within the 90% of predictions that the model got “right,” some examples may have been incorrectly labeled, meaning both the model and the ground truth are wrong. This artificially inflates the measured accuracy.
  2. Conversely, within the 10% of “incorrect” predictions, some may actually be cases where the model is right and the ground truth label is wrong. This artificially deflates the measured accuracy.

Given these complications, how much can the true accuracy vary?

Range of true accuracy

True accuracy of model for perfectly correlated and perfectly uncorrelated errors of model and label. Figure by author.

The true accuracy of our model depends on how its errors correlate with the errors in the ground truth labels. If our model’s errors perfectly overlap with the ground truth errors (i.e., the model is wrong in exactly the same way as human labelers), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 — (1–0.96) = 86%

Alternatively, if our model is wrong in exactly the opposite way as human labelers (perfect negative correlation), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 + (1–0.96) = 94%

Or more generally:

Aᵗʳᵘᵉ = Aᵐᵒᵈᵉˡ ± (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

It’s important to note that the model’s true accuracy can be both lower and higher than its reported accuracy, depending on the correlation between model errors and ground truth errors.

Probabilistic estimate of true accuracy

In some cases, inaccuracies among labels are randomly spread among the examples and not systematically biased toward certain labels or regions of the feature space. If the model’s inaccuracies are independent of the inaccuracies in the labels, we can derive a more precise estimate of its true accuracy.

When we measure Aᵐᵒᵈᵉˡ (90%), we’re counting cases where the model’s prediction matches the ground truth label. This can happen in two scenarios:

  1. Both model and ground truth are correct. This happens with probability Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ.
  2. Both model and ground truth are wrong (in the same way). This happens with probability (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ).

Under independence, we can express this as:

Aᵐᵒᵈᵉˡ = Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ + (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

Rearranging the terms, we get:

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ + Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1) / (2 × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1)

In our example, that equals (0.90 + 0.96–1) / (2 × 0.96–1) = 93.5%, which is within the range of 86% to 94% that we derived above.

The independence paradox

Plugging in Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ as 0.96 from our example, we get

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ — 0.04) / (0.92). Let’s plot this below.

True accuracy as a function of model’s reported accuracy when ground truth accuracy = 96%. Figure by author.

Strange, isn’t it? If we assume that model’s errors are uncorrelated with ground truth errors, then its true accuracy Aᵗʳᵘᵉ is always higher than the 1:1 line when the reported accuracy is > 0.5. This holds true even if we vary Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ:

Model’s “true” accuracy as a function of its reported accuracy and ground truth accuracy. Figure by author.

Error correlation: why models often struggle where humans do

The independence assumption is crucial but often doesn’t hold in practice. If some images of cats are very blurry, or some small dogs look like cats, then both the ground truth and model errors are likely to be correlated. This causes Aᵗʳᵘᵉ to be closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)) than the upper bound.

More generally, model errors tend to be correlated with ground truth errors when:

  1. Both humans and models struggle with the same “difficult” examples (e.g., ambiguous images, edge cases)
  2. The model has learned the same biases present in the human labeling process
  3. Certain classes or examples are inherently ambiguous or challenging for any classifier, human or machine
  4. The labels themselves are generated from another model
  5. There are too many classes (and thus too many different ways of being wrong)

Best practices

The true accuracy of a model can differ significantly from its measured accuracy. Understanding this difference is crucial for proper model evaluation, especially in domains where obtaining perfect ground truth is impossible or prohibitively expensive.

When evaluating model performance with imperfect ground truth:

  1. Conduct targeted error analysis: Examine examples where the model disagrees with ground truth to identify potential ground truth errors.
  2. Consider the correlation between errors: If you suspect correlation between model and ground truth errors, the true accuracy is likely closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)).
  3. Obtain multiple independent annotations: Having multiple annotators can help estimate ground truth accuracy more reliably.

Conclusion

In summary, we learned that:

  1. The range of possible true accuracy depends on the error rate in the ground truth
  2. When errors are independent, the true accuracy is often higher than measured for models better than random chance
  3. In real-world scenarios, errors are rarely independent, and the true accuracy is likely closer to the lower bound

The post How to Measure Real Model Accuracy When Labels Are Noisy appeared first on Towards Data Science.

]]>
Unlock the Power of ROC Curves: Intuitive Insights for Better Model Evaluation https://towardsdatascience.com/unlock-the-power-of-roc-curves-intuitive-insights-for-better-model-evaluation/ Tue, 08 Apr 2025 19:38:43 +0000 https://towardsdatascience.com/?p=605689 Go beyond the definitions: grasp the real meaning of AUC and ROC analysis for practical data science

The post Unlock the Power of ROC Curves: Intuitive Insights for Better Model Evaluation appeared first on Towards Data Science.

]]>
We’ve all been in that moment, right? Staring at a chart as if it’s some ancient script, wondering how we’re supposed to make sense of it all. That’s exactly how I felt when I was asked to explain the AUC for the ROC curve at work recently.

Though I had a solid understanding of the math behind it, breaking it down into simple, digestible terms proved to be a challenge. I realized that if I was struggling with it, others probably were too. So, I decided to write this article to share an intuitive way to understand the AUC-ROC curve through a practical example. No dry definitions here—just clear, straightforward explanations focused on the intuition.

Here’s the code1 used in this article.

Every data scientist goes through a phase of evaluating classification models. Amidst an array of evaluation metrics, Receiver Operating Characteristic (ROC) curve and the Area Under The Curve (AUC) is an indispensable tool for gauging model’s performance. In this comprehensive article, we will discuss basic concepts and see them in action using our good old Titanic dataset2.

Section 1: ROC Curve

At its core, the ROC curve visually portrays the delicate balance between a model’s sensitivity and specificity across varying classification thresholds.

To fully grasp the ROC curve, let’s delve into the concepts:

  • Sensitivity/Recall (True Positive Rate): Sensitivity quantifies a model’s adeptness at correctly identifying positive instances. In our Titanic example, sensitivity corresponds to the the proportion of actual survival cases that the model accurately labels as positive.
  • Specificity (True Negative Rate): Specificity measures a model’s proficiency in correctly identifying negative instances. For our dataset, it represents the proportion of actual non-survived cases (Survival = 0) that the model correctly identifies as negative.
  • False Positive Rate: FPR measures the proportion of negative instances that are incorrectly classified as positive by the model.

Notice that Specificity and FPR are complementary to each other. While specificity focuses on the correct classification of negative instances, FPR focuses on the incorrect classification of negative instances as positive. Thus-

Now that we know the definitions, let’s work with an example. For Titanic dataset, I have built a simple logistic regression model that predicts whether the passenger survived the shipwreck or not, using following features: Passenger Class, Sex, # of siblings/spouses aboard, passenger fare and Port of Embarkation. Note that, the model predicts the ‘probability of survival’. The default threshold for logistic regression in sklearn is 0.5. However, this default threshold may not always make sense for the problem being solved and we need to play around with the probability threshold i.e. if the predicted probability > threshold, instance is predicted to be positive else negative.

Now, let’s revisit the definitions of Sensitivity, Specificity and FPR above. Since our predicted binary classification is dependent on the probability threshold, for the given model, these three metrics will change based on the probability threshold we use. If we use a higher probability threshold, we will classify fewer cases as positives i.e. our true positives will be fewer, resulting in lower Sensitivity/Recall. A higher probability threshold also means fewer false positives, so low FPR. As such, increasing sensitivity/recall could lead to increased FPR.

For our training data, we will use 10 different probability cutoffs and calculate Sensitivity/TPR and FPR and plot in a chart below. Note, the size of circles in the scatterplot correspond to the probability threshold used for classification.

Chart 1: FPR vs TPR chart along with actual values in the DataFrame (image by author)

Well, that’s it. The graph we created above plots Sensitivity (TPR) Vs. FPR at various probability thresholds IS the ROC curve!

In our experiment, we used 10 different probability cutoffs with an increment of 0.1 giving us 10 observations. If we use a smaller increment for the probability threshold, we will end up with more data points and the graph will look like our familiar ROC curve.

To confirm our understanding, for the model we built for predicting passenger’s survival, we will loop through various predicted probability thresholds and calculate TPR, FPR for the testing dataset (see code snippet below). Plot the results in a graph and compare this graph with the ROC curve plotted using sklearn’s roc_curve3 function.

Chart 2: sklearn ROC curve on the left and manually created ROC curve on right (image by author)

As we can see, the two curves are almost identical. Note the AUC=0.92 was calculated using the roc_auc_score4 function. We will discuss this AUC in the later part of this article.

To summarize, ROC curve plots TPR and FPR for the model at various probability thresholds. Note that, the actual probabilities are NOT displayed in the graph, but one can assume that the observations on the lower left side of the curve correspond to higher probability thresholds (low TPR), and observation on the top right side correspond to lower probability thresholds (high TPR).

To visualize what’s stated above, refer to the below chart, where I have tried to annotate TPR and FPR at different probability cutoffs.

Chart 3: ROC Curve with different probability cutoffs (image by author)

Section 2: AUC

Now that we have developed some intuition around what ROC curve is, the next step is to understand Area Under the Curve (AUC). But before delving into the specifics, let’s think about what a perfect classifier looks like. In the ideal case, we want the model to achieve perfect separation between positive and negative observations. In other words, the model assigns low probabilities to negative observations and high probabilities to positive observations with no overlap. Thus, there will exist some probability cut off, such that all observations with predicted probability < cut off are negative, and all observations with probability >= cut off are positive. When this happens, True Positive Rate will be 1 and False Positive Rate will be 0. So the ideal state to achieve is TPR=1 and FPR=0. In reality, this does not happen, and a more practical expectation should be to maximize TPR and minimize FPR.

In general, as TPR increases with lowering probability threshold, the FPR also increases (see chart 1). We want TPR to be much higher than FPR. This is characterized by the ROC curve that is bent towards the top left side. The following ROC space chart shows the perfect classifier with a blue circle (TPR=1 and FPR=0). Models that yield the ROC curve closer to the blue circle are better. Intuitively, it means that the model is able to fairly separate negative and positive observations. Among the ROC curves in the following chart, light blue is best followed by green and orange. The dashed diagonal line represents random guesses (think of a coin flip).

Chart 4: ROC Curve Comparison (source5)

Now that we understand ROC curves skewed to the top left are better, how do we quantify this? Well, mathematically, this can be quantified by calculating the Area Under the Curve. The Area Under the Curve (AUC) of the ROC curve is always between 0 and 1 because our ROC space is bounded between 0 and 1 on both axes. Among the above ROC curves, the model corresponding to the light blue ROC curve is better compared to green and orange as it has higher AUC.

But how is AUC calculated? Computationally, AUC involves integrating the Roc curve. For models generating discrete predictions, AUC can be approximated using the trapezoidal rule6. In its simplest form, the trapezoidal rule works by approximating the region under the graph as a trapezoid and calculating its area. I’ll probably discuss this in another article.

This brings us to the last and the most awaited part — how to intuitively make sense of AUC? Let’s say you built a first version of a classification model with AUC 0.7 and you later fine tune the model. The revised model has an AUC of 0.9. We understand that the model with higher AUC is better. But what does it really mean? What does it imply about our improved prediction power? Why does it matter? Well, there’s a lot of literature explaining AUC and its interpretation. Some of them are too technical, some incomplete, and some are outright wrong! One interpretation that made the most sense to me is:

AUC is the probability that a randomly chosen positive instance possesses a higher predicted probability than a randomly chosen negative instance.

Let’s verify this interpretation. For the simple logistic regression we built, we will visualize the predicted probabilities of positive and negative classes (i.e. Survived the shipwreck or not).

Chart 5: Predicted Probabilities of Survived and Not Survived Passengers (image by author)

We can see the model performs pretty well in assigning a higher probability to Survived cases than those that did not. There’s some overlap of probabilities in the middle section. The AUC calculated using the auc score function in sklearn for our model on the test dataset is 0.92 (see chart 2). So based on the above interpretation of AUC, if we randomly choose a positive instance and a negative instance, the probability that the positive instance will have a higher predicted probability than the negative instance should be ~92%.

For this purpose, we will create pools of predicted probabilities of positive and negative outcomes. Now we randomly select one observation each from both the pools and compare their predicted probabilities. We repeat this 100K times. Later we calculate % of times the predicted probability of a positive instance was > predicted probability of a negative instance. If our interpretation is correct, this should be equal to AUC.

We did indeed get 0.92! Hope this helps.

Let me know your comments and feel free to connect with me on LinkedIn.

Note — this article is revised version of the original article that I wrote on Medium in 2023.


References:

  1. https://github.com/Swpnilsp/ROC-AUC-Curve/blob/main/RoC_Curve_Analysis%20(2).ipynb
  2. https://www.kaggle.com/competitions/titanic/data (License-CC0: Public Domain)
  3. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
  4. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
  5. https://en.wikipedia.org/wiki/Receiver_operating_characteristic
  6. https://en.wikipedia.org/wiki/Trapezoidal_rule

The post Unlock the Power of ROC Curves: Intuitive Insights for Better Model Evaluation appeared first on Towards Data Science.

]]>
Linear Programming: Managing Multiple Targets with Goal Programming https://towardsdatascience.com/linear-programming-managing-multiple-targets-with-goal-programming/ Thu, 03 Apr 2025 19:12:16 +0000 https://towardsdatascience.com/?p=605406 Part 6: Balancing multiple objectives using the weights and preemptive goal programming approaches

The post Linear Programming: Managing Multiple Targets with Goal Programming appeared first on Towards Data Science.

]]>

This is the sixth (and likely last) part of a Linear Programming series I’ve been writing. With the core concepts covered by the prior articles, this article focuses on goal programming which is a less frequent linear programming (LP) use case. Goal programming is a specific linear programming setup that can handle the optimization of multiple objectives in a single LP problem.

By the end of this article, you’ll understand:
1. Definition of goal programming and when it should be used
2. The weighted goal programming approach — illustrated with an example
3. The preemptive goal programming approach — illustrated with an example

Definition and Use Case of Goal Programming

Goal programming is a special case of linear programming that allows for multiple — often conflicting — objectives to be balanced. With regular LP problems, you pick a single metric to optimize (minimize or maximize) and set constraints to ensure that the optimal solution is feasible. Goal programming is a technique that allows for multiple objective metrics to be targeted at the same time.

The ‘mindset’ of goal programming is fundamentally different from plain vanilla LP problems. Basic LP searches for ways to get as little or as much of a single metric as possible — e.g., maximize profit or minimize waste  — subject to constraints. Often conflicting priorities will be found in the objective function or the constraints. For example, maximize profit (objective) subject to a maximum amount of waste (constraint). With goal programming, we can move important constraint metrics into the objective function so we can optimize on them rather than just constraining them. We can maximize profit and minimize waste at the same time!

Now is a good time to establish the example we will explore for the rest of the article:

Let’s imagine that we manage a factory that makes clothing. Our factory can make pants, shirts, and dresses. Each article of clothing has costs, profit, and waste associated with their production. We want to create a production plan that will make a certain level of profit and also waste below a certain quantity due to an environmental commitment. Let’s say that we want to make $150k a month in profit and we also want to waste less than 20k yards of fabric. In addition to our goals, we can’t spend more than $50k in materials and labor. 

The example above sounds a lot like a regular linear programming problem right? Well, the twist is that we can’t make $150k in profit and waste less than 20k of yards at the same time. In other words, there would be no feasible solution if we were to plug this into a regular linear programming problem. Typically, the goals set in the problems cannot all be achieved with a single solution, otherwise there wouldn’t be a point in using goal programming. We would just use regular old linear programming with our goals as constraints. The real value of goal programming is that it can create a compromise between conflicting goals when regular linear programming would yield an infeasible solution.

The real value of goal programming is that it can create a compromise between conflicting goals when regular linear programming would yield an infeasible solution.

How does goal programming balance and compromise with conflicting goals? There are two popular approaches: (1) weighted and (2) preemptive. We’ll cover these in detail in the following sections.

The weights approach

Here, we’ll dive into the details of the weights approach. The weights method has a single objective function and runs a single Optimization based on (you guessed it) weights! The fact that only one optimization is run under the weights method may seem like a given — but the preemptive method actually runs multiple linear programming optimizations. We’ll get to that in the next section…

The weights method has specific targets or goals for multiple metrics — e.g., make at least $150k in profit selling clothes or waste no more than 20k yards of fabric. For regular LP, we want to fully optimize. For the weights method of goal programming, we want to get as close to hitting goals as possible — after we reach a goal, the optimization doesn’t see more benefit in further maximization or minimization, so it prioritizes hitting the next most important goal. If this seems confusing now, don’t worry it will make more sense as we get into the example.

The objective function for the weights approach is specially formulated to minimize the weighted differences between a metric’s goal and the actual value of the metric. Let’s jump to our example from above — i.e., we want to make $150k in profit and waste less than 20k yards of raw material. Our objective is to minimize how far off we are from both of these goals. 

Here is the mathematical formulation for this objective:

Where w1 and w2 are assigned weights and P and W are how far we miss our profit and waste goals respectively

With our objective function set up, we need to define our constraints. We will have two types of constraints (1) goal related constraints and (2) regular linear programming constraints (the same kind of constraints you would find in plain vanilla LP). Let’s talk about the goal related constraints first.

We will need to create two things to set up our goal related constraints, (1) profit and waste functions and (2) multiple slack variables. Let’s go through these one at a time.

The profit and waste functions are very straight forward. They combine our decision variables together to calculate total profit and total waste give a specific solution. Below are the formulas that tie profit and waste to the number of pants, shirts and dresses we produce:

profit and waste functions

With our profit and waste functions established, let’s start talking about our slack variables. In goal programming, slack variables are used to measure how far a solution is from hitting a goal. In our example, the variables P and W are both slack variables — they represent how much lower our profit is compared to our profit goal and how much higher our waste is compared to our waste goal. Slack variables are embedded in constraints. Below are the constraint functions for our profit and waste goals — again, the P’s and the W’s are our slack variables:

P+, P-, W+ and W- are slack variables, profit and waste are functions established with the formulas above

Note that we have plus and minus slack variables — this allows us to miss the goal on either end. We only want to penalize the slack variable that goes in the opposite direction of our goal (e.g., we don’t want to penalize more profit than our goal, we only want to penalize less profit) — that is why only one slack variable per goal is in the objective function. With this new notation, let’s rewrite our objective function:

Objective function with updated slack variable notation

We have now done all of the special work for goal programming. The last thing we need to do is quickly add our plain vanilla budget constraint. We are using a regular constraint for our budget because, in our example, it is a hard constraint. Unlike with profit and waste, we cannot violate the budget.

Regular (not goal programming related) budget constraint

Now, we have a fully specified goal programming problem. Let’s set it up in Python and solve!

# $150,000 in profit
problem += profit + profit_deviation_neg - profit_deviation_pos == 150000, "Profit_Goal"

# Waste goal: No more than 20,000 yards of waste
problem += waste + waste_deviation_neg - waste_deviation_pos == 20000, "Cost_Goal"

# Budget constraint
problem += cost <= 50000

# Solve the problem
problem.solve()

# Display the results
print("Status:", pulp.LpStatus[problem.status])
print("Pants produced:", pulp.value(pants))
print("Shirts produced:", pulp.value(shirts))
print("Dresses produced:", pulp.value(dresses))
print("Cost :", pulp.value(cost))
print("Profit :", pulp.value(profit))
print("Profit deviation (positive):", pulp.value(profit_deviation_pos))
print("Profit deviation (negative):", pulp.value(profit_deviation_neg))
print("Waste :", pulp.value(waste))
print("Waste deviation (positive):", pulp.value(waste_deviation_pos))
print("Waste deviation (negative):", pulp.value(waste_deviation_neg))

This optimization recommends we make 0 pants, 5,000 shirts and 0 dresses. We make $150k in profit which matches our goal exactly and we waste 50k yards of fabric which exceeds our max waste by 30k yards. The full results are printed by the code and shown below:

results of optimization run with equal weights

Now that we’ve got the basic structure of the weights approach down, let’s actually talk about the weights! In our first example, we gave equal weights to a dollar of profit and to a yard of waste. This probably doesn’t make a lot of sense because they are different units. Setting the weights is a subjective decision to be made by the practitioner. For our example, we’ll decide that wasting 1.5 yards of fabric is as bad as making $1 of profit is good. In other words, we’ll increase the weight of fabric waste to 1.5 in our objective function.

problem += profit_deviation_neg + 1.5*waste_deviation_pos

The optimization with the updated rates recommends we make about 8,572 pants, 7,143 shirts and 0 dresses. With this solution, we generate $107k in profit (which is a goal miss of $43k) and we waste 20,000 yards of fabric which matches our goal exactly. The full results are printed by the code and shown below:

Results of optimization run with 1.5 weight on fabric waste

Clearly, shifting the weights of the goals can have large impact on the optimization results. We need to carefully set our weights to adequately balance the relative importance of our goals!

Now that we have a solid understanding of how the weighted approach works, let’s shift to talking about the goal programming with the preemptive approach.

Preemptive Approach

While the weights method balances goals using weights in the objective function, the preemptive method gives hierarchical priority to goals through iterative optimizations. That’s a lot of words, don’t worry, we’ll break it down!

Here are the steps to the preemptive approach:

1. Run a regular linear programming optimization on your first goal — e.g., maximize profit
2. Save the objective value from that run
3. Run another regular linear programming on the next most important goal — e.g., minimize waste — but, add the objective value from the last run as a constraint
4. Repeat the process until you have gone through all goal metrics

Two important features of the preemptive method are (1) it prioritizes goals by rank and (2) the objective value of a higher importance goal cannot be decreased (because of the hard constraint) when optimizing lower priority goals. Let’s go over an example to build intuition.

For our example, let’s say that profit is the most important goal and minimizing waste is the second. We’ll start out by running a plain vanilla optimization that maximizes profit:

# Define the problem
problem = pulp.LpProblem("Clothing_Company_Goal_Programming", pulp.LpMaximize)

# Decision variables: number of pants, shirts, and dresses produced
pants = pulp.LpVariable('pants', lowBound=0, cat='Continuous')
shirts = pulp.LpVariable('shirts', lowBound=0, cat='Continuous')
dresses = pulp.LpVariable('dresses', lowBound=0, cat='Continuous')

# Profit and cost coefficients
profit = 10 * pants + 3 * shirts + 15 * dresses
cost = 5 * pants + 1 * shirts + 10 * dresses
waste = 1.5 * pants + 1 * shirts + 3 * dresses

# Objective function: Maximize profit
problem += profit

# Constraints
# Budget constraint
problem += cost <= 50000

# Solve the problem
problem.solve()

# Display the results
print("Status:", pulp.LpStatus[problem.status])
print("Pants produced:", pulp.value(pants))
print("Shirts produced:", pulp.value(shirts))
print("Dresses produced:", pulp.value(dresses))
print("Cost :", pulp.value(cost))
print("Profit :", pulp.value(profit))

The results of the profit maximizing LP problem are below:

Profit maximization

So, our objective function says to make 50k shirts and collect a profit of $150k. This was only the first optimization we are going to run though! Following the algorithm outlined above, we will now run another LP that minimizes waste but, we will add a constraint of profit ≥ $150k to ensure we don’t get a worse profit.

# Define the problem
problem = pulp.LpProblem("Clothing_Company_Goal_Programming", pulp.LpMinimize)

# Decision variables: number of pants, shirts, and dresses produced
pants = pulp.LpVariable('pants', lowBound=0, cat='Continuous')
shirts = pulp.LpVariable('shirts', lowBound=0, cat='Continuous')
dresses = pulp.LpVariable('dresses', lowBound=0, cat='Continuous')

# Profit and cost coefficients
profit = 10 * pants + 3 * shirts + 15 * dresses
cost = 5 * pants + 1 * shirts + 10 * dresses
waste = 1.5 * pants + 1 * shirts + 3 * dresses

# Objective function: Minimize the fabric waste
problem += waste

# Budget constraint
problem += cost <= 50000

problem += profit >= 150000

# Solve the problem
problem.solve()

# Display the results
print("Status:", pulp.LpStatus[problem.status])
print("Pants produced:", pulp.value(pants))
print("Shirts produced:", pulp.value(shirts))
print("Dresses produced:", pulp.value(dresses))
print("Cost :", pulp.value(cost))
print("Profit :", pulp.value(profit))

And here are the results of this final optimization:

results of waste minimizing optimization

The astute observer will notice that the optimization is the exact same 😅. This can often be the case with the preemptive method. The constraint of previously optimized goals can be very restrictive. The only time when iterative optimizations are different is if there are multiple ways to get the optimal values for previous goals. For example, if there were two ways to get $150k in profit; one had more waste and the other had less, our second iteration would return the results of the solution with the lower waste. In the preemptive method, there is no trade off between the goals. Even if there was a solution that made $149k in profit with 0 yards of waste, the preemptive method would always choose $150k in profit with 50k yards of waste. The extra $1000 of profit is infinitely more important than any amount of wasted fabric.

The preemptive method should be used when goals are clearly prioritized, and there is no substitution between them — meaning no amount of success in lower priority goals can compensate for reduced optimization in higher priority ones. When used correctly, the preemptive method can help optimize a primary goal while trying to find good solutions for lower priority goals very effectively.

Conclusion

Goal programming provides a framework that extends traditional linear programming to optimize multiple metrics at the same time. The weighted approach balances priorities via weights in the objective function and is appropriate when objective metrics have relative importance that can be quantified. The preemptive method is an iterative approach that gives priority to metrics based on hierarchy. It is appropriate when some objectives are wholly more important than others. Both approaches can be powerful optimization techniques when applied in the correct context!

Happy optimizing!

Earlier articles in this series:

The post Linear Programming: Managing Multiple Targets with Goal Programming appeared first on Towards Data Science.

]]>
Uncertainty Quantification in Machine Learning with an Easy Python Interface https://towardsdatascience.com/uncertainty-quantification-in-machine-learning-with-an-easy-python-interface/ Wed, 26 Mar 2025 19:14:47 +0000 https://towardsdatascience.com/?p=605304 The ML Uncertainty Package

The post Uncertainty Quantification in Machine Learning with an Easy Python Interface appeared first on Towards Data Science.

]]>
Uncertainty quantification (UQ) in a Machine Learning (ML) model allows one to estimate the precision of its predictions. This is extremely important for utilizing its predictions in real-world tasks. For instance, if a machine learning model is trained to predict a property of a material, a predicted value with a 20% uncertainty (error) is likely to be used very differently from a predicted value with a 5% uncertainty (error) in the overall decision-making process. Despite its importance, UQ capabilities aren’t available with popular ML software in Python, such as scikit-learn, Tensorflow, and Pytorch.

Enter ML Uncertainty: a Python package designed to address this problem. Built on top of popular Python libraries such as SciPy and scikit-learn, ML Uncertainty provides a very intuitive interface to estimate uncertainties in ML predictions and, where possible, model parameters. Requiring only about four lines of code to perform these estimations, the package leverages powerful and theoretically rigorous mathematical methods in the background. It exploits the underlying statistical properties of the ML model in question, making the package computationally inexpensive. Moreover, this approach extends its applicability to real-world use cases where often, only small amounts of data are available.

Motivation

I have been an avid Python user for the last 10 years. I love the large number of powerful libraries that have been created and maintained, and the community, which is very active. The idea for ML Uncertainty came to me when I was working on a hybrid ML problem. I had built an ML model to predict stress-strain curves of some polymers. Stress-strain curves–an important property of polymers–obey certain physics-based rules; for instance, they have a linear region at low strain values, and the tensile modulus decreases with temperature.

I found from literature some non-linear models to describe the curves and these behaviors, thereby reducing the stress-strain curves to a set of parameters, each with some physical meaning. Then, I trained an ML model to predict these parameters from some easily measurable polymer attributes. Notably, I only had a few hundred data points, as is quite common in scientific applications. Having trained the model, finetuned the hyperparameters, and performed the outlier analysis, one of the stakeholders asked me: “This is all good, but what are the error estimates on your predictions?” And I realized that there wasn’t an elegant way to estimate this with Python. I also realized that this wasn’t going to be the last time that this problem was going to arise. And that led me down the path that culminated in this package. 

Having spent some time studying Statistics, I suspected that the math for this wasn’t impossible or even that hard. I began researching and reading up books like Introduction to Statistical Learning and Elements of Statistical Learning1,2 and found some answers there. ML Uncertainty is my attempt at implementing some of those methods in Python to integrate statistics more tightly into machine learning. I believe that the future of machine learning depends on our ability to increase the reliability of predictions and the interpretability of models, and this is a small step towards that goal. Having developed this package, I have frequently used it in my work, and it has benefited me greatly.

This is an introduction to ML Uncertainty with an overview of the theories underpinning it. I have included some equations to explain the theory, but if those are overwhelming, feel free to gloss over them. For every equation, I have stated the key idea it represents.

Getting started: An example

We often learn best by doing. So, before diving deeper, let’s consider an example. Say we are working on a good old-fashioned linear regression problem where the model is trained with scikit-learn. We think that the model has been trained well, but we want more information. For instance, what are the prediction intervals for the outputs? With ML Uncertainty, this can be done in 4 lines as shown below and discussed in this example.

Illustrating ML uncertainty code (a) and plot (b) for linear regression. Image by author.

All examples for this package can be found here: https://github.com/architdatar/ml_uncertainty/tree/main/examples.

Delving deeper: A peek under the hood

ML Uncertainty performs these computations by having the ParametricModelInference class wrap around the LinearRegression estimator from scikit-learn to extract all the information it needs to perform the uncertainty calculations. It follows the standard procedure for uncertainty estimation, which is detailed in many a statistics textbook,2 of which an overview is shown below.

Since this is a linear model that can be expressed in terms of parameters (\( \beta \)) as \( y = X\beta \), ML Uncertainty first computes the degrees of freedom for the model (\( p \)), the error degrees of freedom (\( n – p – 1 \)), and the residual sum of squares (\( \hat{\sigma}^2 \)). Then, it computes the uncertainty in the model parameters; i.e., the variance-covariance matrix.3

\( \text{Var}(\hat{\beta}) = \hat{\sigma}^2 (J^T J)^{-1} \)

Where \( J \) is the Jacobian matrix for the parameters. For linear regression, this translates to:

\( \text{Var}(\hat{\beta}) = \hat{\sigma}^2 (X^T X)^{-1} \)

Finally, the get_intervals function computes the prediction intervals by propagating the uncertainties in both inputs as well as the parameters. Thus, for data \( X^* \) where predictions and uncertainties are to be estimated, predictions \( \hat{y^*} \) along with the \( (1 – \alpha) \times 100\% \) prediction interval are:

\( \hat{y^*} \pm t_{1 – \alpha/2, n – p – 1} \, \hat{\sigma} \sqrt{\text{Var}(\hat{y^*})} \)

Where,

\( \text{Var}(\hat{y^*}) = (\nabla_X f)(\delta X^*)^2(\nabla_X f)^T + (\nabla_\beta f)(\delta \hat{\beta})^2(\nabla_\beta f)^T + \hat{\sigma}^2 \)

In English, this means that the uncertainty in the output depends on the uncertainty in the inputs, uncertainty in the parameters, and the residual uncertainty. Simplified for a multiple linear model and assuming no uncertainty in inputs, this translates to:

\( \text{Var}(\hat{y^*}) = \hat{\sigma}^2 \left(1 + X^* (X^T X)^{-1} X^{*T} \right) \)

Extensions to linear regression

So, this is what goes on under the hood when those four lines of code are executed for linear regression. But this isn’t all. ML Uncertainty comes equipped with two more powerful capabilities:

  1. Regularization: ML Uncertainty supports L1, L2, and L1+L2 regularization. Combined with linear regression, this means that it can cater to LASSO, ridge, and elastic net regressions. Check out this example.
  2. Weighted least squares regression: Sometimes, not all observations are equal. We might want to give more weight to some observations and less weight to others. Commonly, this happens in science when some observations have a high amount of uncertainty while some are more precise. We want our regression to reflect the more precise ones, but cannot fully discard the ones with high uncertainty. For such cases, the weighted least squares regression is used.

Most importantly, a key assumption of linear regression is something known as homoscedasticity; i.e., that the samples of the response variables are drawn from populations with similar variances. If this is not the case, it is handled by assigning weights to responses depending on the inverse of their variance. This can be easily handled in ML Uncertainty by simply specifying the sample weights to be used during training in the y_train_weights parameter of the ParametricModelInference class, and the rest will be handled. An application of this is shown in this example, albeit for a nonlinear regression case.

Basis expansions

I am always fascinated by how much ML we can get done by just doing linear regression properly. Many kinds of data such as trends, time series, audio, and images, can be represented by basis expansions. These representations behave like linear models with many amazing properties. ML Uncertainty can be used to compute uncertainties for these models easily. Check out these examples called spline_synthetic_data, spline_wage_data, and fourier_basis.

Results of ML Uncertainty used for weighted least squares regression, B-Spline basis with synthetic data, B-Spline basis with wage data, and Fourier basis. Image by author.

Beyond linear regression

We often encounter situations where the underlying model cannot be expressed as a linear model. This commonly occurs in science, for instance, when complex reaction kinetics, transport phenomena, process control problems, are modeled. Standard Python packages like scikit-learn, etc., don’t allow one to directly fit these non-linear models and perform uncertainty estimation on them. ML Uncertainty ships with a class called NonLinearRegression capable of handling non-linear models. The user can specify the model to be fit and the class handles fitting with a scikit-learn-like interface which uses a SciPy least_squares function in the background. This can be easily integrated with the ParametericModelInference class for seamless uncertainty estimation. Like linear regression, we can handle weighted least squares and regularization for non-linear regression. Here is an example.

Random Forests

Random Forests have gained significant popularity in the field. They operate by averaging the predictions of decision trees. Decision trees, in turn, identify a set of rules to divide the predictor variable space (input space) and assign a response value to each terminal node (leaf). The predictions from decision trees are averaged to provide a prediction for the random forest.1 They are particularly useful because they can identify complex relationships in data, are accurate, and make fewer assumptions about the data than regressions do.

While it is implemented in popular ML libraries like scikit-learn, there is no straightforward way to estimate prediction intervals. This is particularly important for regression as random forests, given their high flexibility, tend to overfit their training data. Since random forests doesn’t have parameters like traditional regression models do, uncertainty quantification needs to be performed differently. 

We use the basic idea of estimating prediction intervals using bootstrapping as described by Hastie et al. in Chapter 7 of their book Elements of Statistical Learning.2 The central idea we can exploit is that the variance of the predictions \( S(Z) \) for some data \( Z \) can be estimated via predictions of its bootstrap samples as follows:

\( \widehat{\text{Var}}[S(Z)] = \frac{1}{B – 1} \sum_{b=1}^{B} \left( S(Z^{*b}) – \bar{S}^{*} \right)^2 \)

Where \( \bar{S}^{*} = \sum_b S(Z^{*b}) / B \). Bootstrap samples are samples drawn from the original dataset repeatedly and independently, thereby allowing repetitions. Lucky for us, random forests are trained using one bootstrap sample for each decision tree within it. So, the prediction from each tree results in a distribution whose variance gives us the variance of the prediction. But there is still one problem. Let’s say we want to obtain the variance in prediction for the \( i^{\text{th}} \) training sample. If we simply use the formula above, some predictions will be from trees that include the \( i^{\text{th}} \) sample in the bootstrap sample on which they are trained. This could lead to an unrealistically smaller variance estimate.

To solve this problem, the algorithm implemented in ML Uncertainty only considers predictions from trees which did not use the \( i^{\text{th}} \) sample for training. This results in an unbiased estimate of the variance.

The beautiful thing about this approach is that we don’t need any additional re-training steps. Instead, the EnsembleModelInference class elegantly wraps around the RandomForestRegressor estimator in scikit-learn and obtains all the necessary information from it.

This method is benchmarked using the method described in Zhang et al.,4 which states that a correct \( (1 – \alpha) \times 100\% \) prediction interval is one for which the probability of it containing the observed response is \( (1 – \alpha) \times 100\% \). Mathematically,

\( P(Y \in I_{\alpha}) \approx 1 – \alpha \)

Here is an example to see ML Uncertainty in action for random forest models.

Uncertainty propagation (Error propagation)

How much does a certain amount of uncertainty in input variables and/or model parameters affect the uncertainty in the response variable? How does this uncertainty (epistemic) compare to the inherent uncertainty in the response variables (aleatoric uncertainty)? Often, it is important to answer these questions to decide on the course of action. For instance, if one finds that the uncertainty in model parameters contributes highly to the uncertainty in predictions, one could collect more data or investigate alternative models to reduce this uncertainty. Conversely, if the epistemic uncertainty is smaller than the aleatoric uncertainty, trying to reduce it further might be pointless. With ML uncertainty, these questions can be answered easily.

Given a model relating the predictor variables to the response variable, the ErrorPropagation class can easily compute the uncertainty in responses. Say the responses (\( y \)) are related to the predictor variables (\( X \)) via some function (\( f \)) and some parameters (\( \beta \)), expressed as:

\( y = f(X, \beta) \).

We wish to obtain prediction intervals for responses (\( \hat{y^*} \)) for some predictor data (\( X^* \)) with model parameters estimated as \( \hat{\beta} \). The uncertainty in \( X^* \) and \( \hat{\beta} \) are given by \( \delta X^* \) and \( \delta \hat{\beta} \), respectively. Then, the \( (1 – \alpha) \times 100\% \) prediction interval of the response variables will be given as:

\( \hat{y^*} \pm t_{1 – \alpha/2, n – p – 1} \, \hat{\sigma} \sqrt{\text{Var}(\hat{y^*})} \)

Where,

\( \text{Var}(\hat{y^*}) = (\nabla_X f)(\delta X^*)^2(\nabla_X f)^T + (\nabla_\beta f)(\delta \hat{\beta})^2(\nabla_\beta f)^T + \hat{\sigma}^2 \)

The important thing here is to notice how the uncertainty in predictions includes contributions from the inputs, parameters, as well as the inherent uncertainty of the response.

The ability of the ML Uncertainty package to propagate both input and parameter uncertainties makes it very handy, particularly in science, where we strongly care about the error (uncertainty) in each value being predicted. Consider the often talked about concept of hybrid machine learning. Here, we model known relationships in data through first principles and unknown ones using black-box models. Using ML Uncertainty, the uncertainties obtained from these different methods can be easily propagated through the computation graph.

A very simple example is that of the Arrhenius model for predicting reaction rate constants. The formula \( k = Ae^{-E_a / RT} \) is very well-known. Say, the parameters \( A, E_a \) were predicted from some ML model and have an uncertainty of 5%. We wish to know how much error that translates to in the reaction rate constant.

This can be very easily accomplished with ML Uncertainty as shown in this example.

Illustration of uncertainty propagation through computational graph. Image by author.

Limitations

As of v0.1.1, ML Uncertainty only works for ML models trained with scikit-learn. It supports the following ML models natively: random forest, linear regression, LASSO regression, ridge regression, elastic net, and regression splines. For any other models, the user can create the model, the residual, loss function, etc., as shown for the non-linear regression example. The package has not been tested for neural networks, transformers, and other deep learning models.

Contributions from the open-source ML community are welcome and highly appreciated. While there is much to be done, some key areas of effort are adapting ML Uncertainty to other frameworks such as PyTorch and Tensorflow, adding support for other ML models, highlighting issues, and improving documentation.

Benchmarking

The ML Uncertainty code has been benchmarked against the statsmodels package in Python. Specific cases can be found here.

Background

Uncertainty quantification in machine learning has been studied in the ML community and there is growing interest in this field. However, as of now, the existing solutions are applicable to very specific use cases and have key limitations.

For linear models, the statsmodels library can provide UQ capabilities. While theoretically rigorous, it cannot handle non-linear models. Moreover, the model needs to be expressed in a format specific to the package. This means that the user cannot take advantage of the powerful preprocessing, training, visualization, and other capabilities provided by ML packages like scikit-learn. While it can provide confidence intervals based on uncertainty in the model parameters, it cannot propagate uncertainty in predictor variables (input variables).

Another family of solutions is model-agnostic UQ. These solutions utilize subsamples of training data, train the model repeatedly based on it, and use these results to estimate prediction intervals. While sometimes useful in the limit of large data, these techniques may not provide accurate estimates for small training datasets where the samples chosen might lead to substantially different estimates. Moreover, it is a computationally expensive exercise since the model needs to be retrained multiple times. Some packages using this approach are MAPIE, PUNCC, UQPy, and ml_uncertainty by NIST (same name, different package), among many others.5–8

With ML Uncertainty, the goals have been to keep the training of the model and its UQ separate, cater to more generic models beyond linear regression, exploit the underlying statistics of the models, and avoid retraining the model multiple times to make it computationally inexpensive.

Summary and future work

This was an introduction to ML Uncertainty—a Python software package to easily compute uncertainties in machine learning. The main features of this package have been introduced here and some of the philosophy of its development has been discussed. More detailed documentation and theory can be found in the docs. While this is only a start, there is immense scope to expand this. Questions, discussions, and contributions are always welcome. The code can be found on GitHub and the package can be installed from PyPi. Give it a try with pip install ml-uncertainty.

References

(1) James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer US: New York, NY, 2021. https://doi.org/10.1007/978-1-0716-1418-1.

(2) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer New York: New York, NY, 2009. https://doi.org/10.1007/978-0-387-84858-7.

(3) Börlin, N. Nonlinear Optimization. https://www8.cs.umu.se/kurser/5DA001/HT07/lectures/lsq-handouts.pdf.

(4) Zhang, H.; Zimmerman, J.; Nettleton, D.; Nordman, D. J. Random Forest Prediction Intervals. Am Stat 2020, 74 (4), 392–406. https://doi.org/10.1080/00031305.2019.1585288.

(5) Cordier, T.; Blot, V.; Lacombe, L.; Morzadec, T.; Capitaine, A.; Brunel, N. Flexible and Systematic Uncertainty Estimation with Conformal Prediction via the MAPIE Library. In Conformal and Probabilistic Prediction with Applications; 2023.

(6) Mendil, M.; Mossina, L.; Vigouroux, D. PUNCC: A Python Library for Predictive Uncertainty and Conformalization. In Proceedings of the Twelfth Symposium on Conformal and Probabilistic Prediction with Applications; Papadopoulos, H., Nguyen, K. A., Boström, H., Carlsson, L., Eds.; Proceedings of Machine Learning Research; PMLR, 2023; Vol. 204, pp 582–601.

(7) Tsapetis, D.; Shields, M. D.; Giovanis, D. G.; Olivier, A.; Novak, L.; Chakroborty, P.; Sharma, H.; Chauhan, M.; Kontolati, K.; Vandanapu, L.; Loukrezis, D.; Gardner, M. UQpy v4.1: Uncertainty Quantification with Python. SoftwareX 2023, 24, 101561. https://doi.org/10.1016/j.softx.2023.101561.

(8) Sheen, D. Machine Learning Uncertainty Estimation Toolbox. https://github.com/usnistgov/ml_uncertainty_py.

\[\]

The post Uncertainty Quantification in Machine Learning with an Easy Python Interface appeared first on Towards Data Science.

]]>
Least Squares: Where Convenience Meets Optimality https://towardsdatascience.com/least-squares-where-convenience-meets-optimality/ Tue, 25 Mar 2025 07:50:19 +0000 https://towardsdatascience.com/?p=605225 Beyond being computationally easy, Least Squares is statically optimal and has a deep connection with Maximum Likelihood

The post Least Squares: Where Convenience Meets Optimality appeared first on Towards Data Science.

]]>
0. Introduction

Least Squares is used almost everywhere when it comes to numerical optimization and regression tasks in machine learning. It aims at minimizing the Mean Squared Error (MSE) of a given model.

Both L1 (sum of absolute values) and L2 (sum of squares) norms offer an intuitive way to sum signed errors while preventing them from cancelling each other out. Yet the L2 norm results in a much smoother Loss Function and avoids the kinks of the absolute values.

But why is such a simple loss function so popular? We’ll see that there are pretty solid arguments in favor of the Least Squares, beyond being easy to compute.

  1. Computational Convenience: The square loss function is easy to differentiate and provide a closed-form solution when optimizing a Linear Regression.
  2. Mean and Median: We’re all familiar with these two quantities, but amusingly not many people know that they naturally stem from L2 and L1 losses.
  3. OLS is BLUE: Among all unbiased estimators, Ordinary Least-Squares (OLS) is the Best Linear Unbiased Estimator (BLUE), i.e. the one with lowest variance.
  4. LS is MLE with normal errors: Using Least-Squares to fit any model, linear or not, is equivalent to Maximum Likelihood Estimation under normally distributed errors.

In conclusion, the Least Squares approach totally makes sense from a mathematical perspective. However, bear in mind that it might become unreliable if the theoretical assumptions are no longer fulfilled, e.g. when the data distribution contains outliers.

N.B. I know there’s already a great subreddit, “Why Do We Use Least Squares In Linear Regression?”, about this topic. However, I‘d like to focus in this article on presenting both intuitive understanding and rigorous proofs.


Photo by Pablo Arroyo on Unsplash

1. Computational Convenience

Optimization

Training a model means tweaking its parameters to optimize a given cost function. In some very fortunate cases, its differentiation allows to directly derive a closed-form solution for the optimal parameters, without having to go through an iterative optimization.

Precisely, the square function is convex, smooth, and easy to differentiate. In contrast, the absolute function is non-differentiable at 0, making the optimization process less straightforward.

Differentiability

When training a regression model with n input-output pairs (x,y) and a model f parametrized by θ, the Least-Squares loss function is:

As long as the model f is differentiable with respect to θ, we can easily derive the gradient of the loss function.

Linear Regression

Linear Regression estimates the optimal linear coefficients β given a dataset of n input-output pairs (x,y).

The equation below shows on the left the L1 loss and on the right the L2 loss to evaluate the fitness of β on the dataset.

We usually drop the index <em>i</em> and switch to a vectorized notation to better leverage linear algebra. This can be done by stacking the input vectors as rows to form the design matrix X. Similarly, the outputs are stacked into a vector Y.

Ordinary Least-Squares

The L1 formulation offers very little room for improvement. On the other side, the L2 formulation is differentiable and its gradient becomes zero only for a single optimal set of parameters β. This approach is known as Ordinary Least-Squares (OLS).

Zeroing the gradient yields the closed form solution of the OLS estimator, using the pseudo-inverse matrix. This means we can directly compute the optimal coefficients without the need for a numerical optimization process.

Remarks

Modern computers are really efficient and the performance drop between analytical and numerical solutions is usually not that significant. Thus, computational convenience is not the main reason why we actually use Least-Squares.


Photo by Chris Lawton on Unsplash

2. Mean and Median

Introduction

You’ve certainly already computed a mean or median, whether with Excel, NumPy, or by hand. They are key concepts in Statistics, and often provide valuable insights for income, grades, tests scores or age distributions.

We’re so familiar with these two quantities that we rarely question their origin. Yet, amusingly, they stem naturally from L2 and L1 losses.

Given a set of real values xi, we often try to aggregate them into a single good representative value, e.g. the mean or median. That way, we can more easily compare different sets of values. However, what represents “well” the data is purely subjective and depends on our expectations, i.e. the cost function. For instance, mean and median income are both relevant, but they convey different insights. The mean reflects overall wealth, while the median provides a clearer picture of typical earnings, unaffected by extremely low or high incomes.

Given a cost function ρ, mirroring our expectations, we solve the following optimization problem to find the “best” representative value µ.

Mean

Let’s consider ρ is the L2 loss.

Zeroing the gradient is straightforward and brings out the mean definition.

Thus, we’ve shown that the mean best represents the xi in terms of the L2 loss.

Median

Let’s consider the L1 loss. Being a sum of piecewise linear functions, it is itself piecewise linear, with discontinuities in its gradient at each xi.

The figure below illustrates the L1 loss for each xi . Without loss of generality, I’ve sorted the xi​ to order the non-differentiable kinks. Each function |µ-xi| is xi-µ below xi and µ-xi above.

L1 loss between µ and each xi — Figure by the author

The table below clarifies the piecewise expressions of each individual L1 term |µ-xi|​. We can sum these expressions to get the total L1 loss. With the xi sorted, the leftmost part has a slope of -n and the rightmost a slope of +n.

For better readability, I’ve hidden​ the constant intercepts as <em>Ci</em>.

Piecewise definition table of each individual absolute function and their sum — Figure by the author

Intuitively, the minimum of this piecewise linear function occurs where the slope transitions from negative to positive, which is precisely where the median lies since the points are sorted.

Thus, we’ve shown that the median best represents the xi in terms of the L1 loss.

N.B. For an <em>odd</em> number of points, the median is the middle value and the unique minimizer of the L1 loss. For an <em>even</em> number of points, the median is the average of the two middle values, and the L1 loss forms a plateau, with any value between these two minimizing the loss.


Photo by Fauzan Saari on Unsplash

3. OLS is BLUE

Gauss-Markov theorem

The Gauss-Markov theorem states that the Ordinary Least Squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE). “Best” means that OLS has the lowest variance among all linear unbiased estimators.

This sampling variance represents how much the estimate of the coefficients of β would vary across different samples drawn from the same population.

The theorem assumes Y follows a linear model with true linear coefficients β and random errors ε. That way, we can analyze how the β estimate of an estimator would vary for different values of noise ε.

The assumptions on the random errors ε ensure they are unbiased (zero mean), homoscedastic (constant finite variance), and uncorrelated (diagonal covariance matrix).

Linearity

Be aware that “linearity” in the Gauss-Markov theorem refers to two different concepts:

  • Model Linearity: The regression assumes a linear relationship between Y and X.
  • Estimator Linearity: We only consider estimators linear in Y, meaning they must include a linear component represented by a matrix C that depends only on X.

Unbiasedness of OLS

The OLS estimator, denoted with a hat, has already been derived earlier. Substituting the random error model for Y gives an expression that better captures the deviation from the true β.

We introduce the matrix <em>A</em> to represent the OLS-specific linear component <em>C</em> for better readability.

As expected, the OLS estimator is unbiased, as its expectation is centered around the true β for unbiased errors ε.

Theorem’s proof

Let’s consider a linear estimator, denoted by a tilde, with its linear component A+D, where D represents a shift from the OLS estimator.

The expected value of this linear estimator turns out to be the true β plus an additional term DXβ. For the estimator to be considered unbiased, this term must be zero, thus DX=0. This orthogonality ensures that the shift D does not introduce any bias.

Note that this also implies that DA'=0, which will be useful later.

Now that we’ve guaranteed the unbiasedness of our linear estimator, we can compare its variance against the OLS estimator.

Since the matrix C is constant and the errors ε are spherical, we obtain the following variance.

After substituting C with A+D, expanding the terms, and using the orthogonality of DA', we end up with the variance of our linear estimator being equal to a sum of two terms. The first term is the variance of the OLS estimator, and the second term is positive, due to the positive definiteness of DD’.

As a result, we have shown that the OLS estimator achieves the lowest variance among all linear estimators for Linear Regression with unbiased spherical errors.

Remarks

The OLS estimator is considered “best” in terms of minimum variance. However, it’s worth noting that the definition of the variance itself is closely tied to Least Squares, as it reflects the expectation of the squared difference from the expected value.

Thus, the key question would be why variance is typically defined this way.


Photo by Alperen Yazgı on Unsplash

4. LS is MLE with normal errors

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a method for estimating model parameters θ by maximizing the likelihood of observing the given data (x,y) under the model defined by θ.

Assuming the pairs (xi,yi) are independent and identically distributed (i.i.d.), we can express the likelihood as the product of the conditional probabilities.

A common trick consists in applying a logarithm on top of a product to transform it into a more convenient and numerically stable sum of logs. Since the logarithm is monotonically increasing, it’s still equivalent to solving the same optimization problem. That’s how we get the well known log-likelihood.

In numerical optimization, we usually add a minus sign to minimize quantities instead of maximizing them.

MLE Inference

Once the optimal model parameters θ have been estimated, inference is performed by finding the value of y that maximizes the conditional probability given the observed x, i.e. the most-likely y.

Model Parameters

Note that there’s no specific assumption on the model. It can be of any kind and its parameters are simply stacked into a flat vector θ.

For instance, θ can represent the weights of a neural network, the parameters of a random forest, the coefficients of a linear regression model, and so on.

Normal Errors

As for the errors around the true model, let’s assume that they are unbiased and normally distributed.

It’s equivalent to assuming that <em>y</em> follows a normal distribution with mean predicted by the model and fixed variance σ².

Note that the inference step is straightforward, because the peak of the normal distribution is reached at the mean, i.e. the value predicted by the model.

Interestingly, the exponential term in the normal density cancels out with the logarithm of the log-likelihood. It then turns out to be equivalent to a plain Least-Squares minimization problem!

As a result, using Least-Squares to fit any model, linear or not, is equivalent to Maximum Likelihood Estimation under normally distributed errors.


Photo by Brad Switzer on Unsplash

Conclusion

Fundamental Tool

In conclusion, the popularity of Least-Squares comes from its computational simplicity and its deep link to key statistical principles. It provides a closed form solution for Linear Regression (which is the Best Linear Unbiased Estimator), defines the mean, and is equivalent to Maximum Likelihood Estimation under normal errors.

BLUE or BUE ?

There’s even debate over whether or not the linearity assumption of the Gauss-Markov Theorem can be relaxed, allowing OLS to also be considered the Best Unbiased Estimator (BUE).

We’re still solving Linear Regression, but this time the estimator can remain linear but is also allowed to be non-linear, thus BUE instead of BLUE.

The economist Bruce Hansen thought he had proved it in 2022 [1], but Pötscher and Preinerstorfer quickly invalidated his proof [2].

Outliers

Least-Squares is very likely to become unreliable when errors are not normally distributed, e.g. with outliers.

As we’ve seen previously, the mean defined by L2 is highly affected by extreme values, whereas the median defined by L1 simply ignores them.

Robust loss functions like Huber or Tukey tend to still mimic the quadratic behavior of Least-Squares for small errors, while greatly attenuating the impact of large errors with a near L1 or constant behavior. They are much better choices than L2 to cope with outliers and provide robust estimates.

Regularization

In some cases, using a biased estimator like Ridge regression, which adds regularization, can improve generalization to unseen data. While introducing bias, it helps prevent overfitting, making the model more robust, especially in noisy or high-dimensional settings.


[1] Bruce E. Hansen, 2022. “A Modern Gauss–Markov Theorem,” Econometrica, Econometric Society, vol. 90(3), pages 1283–1294, May.

[2] Pötscher, Benedikt M. & Preinerstorfer, David, 2022. “A Modern Gauss-Markov Theorem? Really?,” MPRA Paper 112185, University Library of Munich, Germany.

The post Least Squares: Where Convenience Meets Optimality appeared first on Towards Data Science.

]]>
One Turn After Another https://towardsdatascience.com/one-turn-after-another/ Fri, 14 Mar 2025 16:54:43 +0000 https://towardsdatascience.com/?p=599564 Game theory 101: Dynamic games

The post One Turn After Another appeared first on Towards Data Science.

]]>
While some games, like rock-paper-scissors, only work if all payers decide on their actions simultaneously, other games, like chess or Monopoly, expect the players to take turns one after another. In Game Theory, the first kind of game is called a static game, while turn-taking is a property of so-called dynamic games. In this article, we will analyse the latter with methods from game theory. 

This article is the fourth part of a four-chapter series on the fundamentals of game theory. I recommend you to read the first three articles if you haven’t done that yet, as the concepts shown here will build on the terms and paradigms introduced in the previous articles. But if you are already familiar with the core fundamentals of game theory, don’t let yourself be stopped, and go ahead!

Dynamic games

Dynamic games can be visualized as trees. Photo by Adarsh Kummur on Unsplash

While so far we only looked at static games, we will now introduce dynamic games where payers take turns. As previously, such games include a number of players n, a set of actions for each player, and a reward function that assesses the actions of a player given the other players’ actions. Beyond that, for a dynamic game, we need to define an order in which the players take their turns. Consider the following tree-like visualization of a dynamic game. 

A visualization of a dynamic game. Figure by author.

At the top we have a node where player 1 has to decide between two actions L and R. This determines whether to follow the left part or the right part of the tree. After player 1’s turn, player 2 takes their turn. If player 1 chooses L, player 2 can decide between l1 and r1. If player 1 chooses R, player 2 has to decide between l2 and r2. At the leaves of the tree (the nodes at the bottom), we see the rewards just like we had them in the matrix cells in static games. For example, if player 1 decides for L and player 2 decides for r1, the reward is (1,0); that is, player 1 gets a reward of 1, and player 2 gets a reward of 0. 

I bet you are eager to find the Nash equilibrium of this game, as this is what Game Theory is mainly about (if you still struggle with the concept of Nash equilibrium, you might want to take a look back at chapter 2 of this series). To do that, we can transform the game into a matrix, as we already know how to find a Nash equilibrium in a game displayed as a matrix. Player 1 decides on the row of the matrix, player 2 decides on the column and the values in the cell then specifies the reward. However, there is one important point to notice. When we look at the game displayed as a tree, player 2 decides on their action after player 1 does and hence only cares about the part of the tree that is actually reached. If player 1 chooses action L, player 2 only decides between l1 and r1 and doesn’t care about l2 and r2, because these actions are out of the question anyway. However, when we search for a Nash Equilibrium, we need to be aware of what would happen, if player 1 would change their action. Therefore, we must know what player 2 would have done if player 1 had chosen a different option. That is why we have four columns in the following matrix, to always account for decisions in both parts of the tree. 

A column like (r1,l2) can be read as “player 2 chooses r1 if player 1 chose L and chooses l2 if player 1 chose R”. On this matrix, we can search for the best answers. For example, the cell (L, (l1,l2)) with reward 3,1 is a best answer. Player 1 has no reason to change from L to R because that would lower his reward (from 3 to 1), and Player 2 has no reason to change either because none of the other options is better (one is as good, though). In total, we find three Nash equilibria, which are underlined in the upcoming matrix: 

The chocolate-pudding market

We will talk about chocolate pudding now. But also about game theory. Photo by American Heritage Chocolate on Unsplash

Our next example brings the idea of dynamic games to life. Let’s assume player 2 is a market-leading retailer of chocolate pudding. Player 1 also wants to build up his business but isn’t sure yet whether to join the chocolate pudding market or whether they rather should sell something else. In our game, player 1 has the first turn and can decide between two actions. Join the market (i.e., sell chocolate pudding), or don’t join the market (i.e., sell something else). If player 1 decides to sell something other than chocolate pudding, player 2 stays the market-dominating retailer for chocolate pudding and player 1 makes some money in the other area they decided for. This is reflected by the reward 1,3 in the right part of the tree in the following figure. 

The market-game as a dynamic game. Figure by author. 

But what if player 1 is greedy for the unimaginable riches that lie dormant on the chocolate pudding market? If they decide to join the market, it is player 2’s turn. They can decide to accept the new competitor, give in and share the market. In this case, both players get a reward of 2. But player 2 can also decide to start a price war to demonstrate his superiority to the new competitor. In this case, both players get a reward of 0, because they ruin their profit due to dumping prices. 

Just like before, we can turn this tree into a matrix and find the Nash equilibria by searching for the best answers:

If player 1 joins the market, the best option for player 1 is to give in. This is an equilibrium because no player has any reason to change. For player 1 it does not make sense to leave the market (that would give a reward of 1 instead of 2) and for player 2 it is no good idea to switch to fighting either (which would give a reward of 0 instead of 2). The other Nash equilibrium happens when player 1 just doesn’t join the market. However, this scenario includes player 2’s decision to fight, if player 1 had chosen to join the market instead. He basically makes a threat and says “If you join the market, I will fight you.” Remember that previously we said we need to know what the players would do even in the cases that don’t appear to happen? Here we see why this is important. Player 1 needs to assume that player 2 would fight because that is the only reason for player 1 to stay out of the market. If player 2 wouldn’t threaten to fight, we wouldn’t have a Nash equilibrium, because then joining the market would become a better option for player 1. 

But how reasonable is this threat? It keeps player 1 outside the market, but what would happen if player 1 didn’t believe the threat and decided to still join the market? Would player 2 really carry out his threat and fight? That would be very silly because it would give him a reward of 0, whereas giving in would give a reward of 2. From that perspective, player 2 used an empty threat that is not very reasonable. If the case really occurs, he wouldn’t carry it out anyway, would he?

Subgame perfect equilibrium

For a subgame perfect equilibrium, before you get the whole picture, you need to start with small parts of the game. Photo by Ben Stern on Unsplash

The previous example showed that sometimes Nash equilibria occur, that are not very reasonable within the game. To cope with this problem, a more strict concept of equilibrium has been introduced which is called a subgame perfect equilibrium. This adds some stricter conditions to the notion of an equilibrium. Hence every subgame perfect equilibrium is a Nash equilibrium, but not all Nash equilibria are subgame perfect. 

A Nash equilibrium is subgame perfect if every subgame of this equilibrium is a Nash equilibrium itself. What does that mean? First, we have to understand that a subgame is a part of the game’s tree that starts at any node. For example, if player 1 chooses L, the remainder of the tree under the node reached by playing L is a subgame. In a likewise fashion, the tree that comes after the node of action R is a subgame. Last but not least, the whole game is always a subgame of itself. As a consequence, the example we started with has three subgames, which are marked in grey, orange and blue in the following: 

The market game has three subgames. Figure by author.

We already saw, that this game has three Nash equilibria which are (L,(l1,l2)), (L, (l1,r2)) and (R,(r1,r2)). Let us now find out which of these are subgame perfect. To this end, we investigate the subgames one after another, starting with the orange one. If we only look at the orange part of the tree, there is a single Nash equilibrium that occurs if player 2 chooses l1. If we look at the blue subgame, there is also a single Nash equilibrium that is reached when player 2 chooses r2. Now that tells us that in every subgame perfect Nash equilibrium, player 2 has to choose option l1 if we arrive in the orange subgame (i.e. if player 1 chooses L) and player 2 has to choose option r2 if we arrive at the blue subgame (i.e., if player 1 chooses R). Only one of the previous Nash equilibria fulfills this condition, namely (L,(l1,r2)). Hence this is the only subgame perfect Nash equilibrium of the whole game. The other two versions are Nash equilibria as well, but they are somewhat unlogical in the sense, that they contain some kind of empty threat, as we had it in the chocolate pudding market example before. The method we just used to find the subgame perfect Nash equilibrium is called backwards induction, by the way. 

Uncertainty

In dynamic games, it can happen that you have to make decisions without knowing exactly what node of the game you are in. Photo by Denise Jans on Unsplash

So far in our dynamic games, we always knew which decisions the other players made. For a game like chess, this is the case indeed, as every move your opponent makes is perfectly observable. However, there are other situations in which you might not be sure about the exact moves the other players make. As an example, we go back to the chocolate pudding market. You take the perspective of the retailer that is already in the market and you have to decide whether you would start fighting if the other player joins the market. But there is one thing you don’t know, namely how aggressive your opponent will be. When you start fighting, will they be frightened easily and give up? Or will they be aggressive and fight you until only one of you is left? This can be seen as a decision made by the other player that influences your decision. If you expect the other player to be a coward, you might prefer to fight, but if they turn out to be aggressive, you would rather want to give in (reminds you of the birds fighting for food in the previous chapter, doesn’t it?). We can model this scenario in a game like this: 

A dynamic game with a hidden decision (indicated by the dotted circle). Figure by author.

The dotted circle around the two nodes indicates, that these are hidden decisions that are not observable to everyone. If you are player 2, you know whether player 1 joined the market or not, but if they joined, you don’t know whether they are aggressive (left node) or moderate (right node). Hence you act under uncertainty, which is a very common ingredient in many games you play in the real world. Poker would become very boring if everybody knew everyone’s cards, that’s why there is private information, namely the cards on your hand only you know about. 

Now you still have to decide whether to fight or give in, although you are not exactly sure what node of the tree you are in. To do that, you have to make assumptions about the likelihood of each state. If you are quite certain that the other player is behaving moderately, you might be up for a fight, but if you assume them to be aggressive, you might prefer giving in. Say there is a Probability p that the other player is aggressive and 1-p that they behave moderately. If you assume p to be high, you should give in, but if p becomes smaller, there should be a point where your decision switches to fighting. Let’s try to find that point. In particular, there should be a sweet spot in between where the probability of the other player being aggressive vs. moderate is such that fighting and giving in are equal alternatives to one another. That is, the rewards would be equal, which we can model as follows: 

Do you see how this formula is derived from the rewards for fighting or giving in in the different leaves of the tree? This formula solves to p=1/3, so if the probability of the other player being aggressive is 1/3 it would make no difference whether to fight or give in. But if you assume the other player to be aggressive with a probability of more than 1/3, you should give in, and if you assume aggressiveness to be less likely than 1/3, you should fight. This is a chain of thought you also have in other games where you act under uncertainty. When you play poker, you might not calculate the probabilities exactly, but you ask yourself, “How likely is it that John has two kings on his hand?” and depending on your assumption of that probability, you check, raise or give up. 

Summary & outlook

Your journey on the seas of game theory has only just begun. There is so much more to explore. Photo by George Liapis on Unsplash

Now we have learned a lot about dynamic games. Let us summarize our key findings. 

  • Dynamic games include an order in which players take turns. 
  • In dynamic games, the players’ possible actions depend on the previously executed actions of the other players. 
  • A Nash equilibrium in a dynamic game can be implausible, as it contains an empty threat that would not be rational.
  • The concept of subgame perfect equilibria prevents such implausible solutions. 
  • In dynamic games, decisions can be hidden. In that case, players may not exactly know which node of the game they are in and have to assign probabilities to different states of the games. 

With that, we have reached the end of our series on the fundamentals of game theory. We have learned a lot, yet there are plenty of things we haven’t been able to cover. Game theory is a science in itself, and we have only been able to scratch the surface. Other concepts that expand the possibilities of game-theoretic analyses include: 

  • Analysing games that are repeated multiple times. If you play the prisoner’s dilemma multiple times, you might be tempted to punish the other player for having betrayed you in the previous round. 
  • In cooperative games, players can conclude binding contracts that determine their actions to reach a solution of the game together. This is different from the non-cooperative games we looked at, where all players are free to decide and maximize their own reward. 
  • While we only looked at discrete games, where each player has a finite number of actions to choose from, continuous games allow an infinite number of actions (e.g., any number between 0 and 1). 
  • A big part of game theory considers the usage of public goods and the problem that individuals might consume these goods without contributing to their maintenance. 

These concepts allow us to analyse real-world scenarios from various fields such as auctions, social networks, evolution, markets, information sharing, voting behaviour and much more. I hope you enjoyed this series and find meaningful applications for the knowledge you gained, be it the analysis of customer behaviour, political negotiations or the next game night with your friends. From a game theory perspective, life is a game!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

  • Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in the English language could be this one:

  • Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

  • Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post One Turn After Another appeared first on Towards Data Science.

]]>
When You Just Can’t Decide on a Single Action https://towardsdatascience.com/when-you-just-cant-decide-on-a-single-action/ Fri, 07 Mar 2025 18:01:37 +0000 https://towardsdatascience.com/?p=599290 Game Theory 101: Mixing strategies

The post When You Just Can’t Decide on a Single Action appeared first on Towards Data Science.

]]>
In Game Theory, the players typically have to make assumptions about the other players’ actions. What will the other player do? Will he use rock, paper or scissors? You never know, but in some cases, you might have an idea of the probability of some actions being higher than others. Adding such a notion of probability or randomness opens up a new chapter in game theory that lets us analyse more complicated scenarios. 

This article is the third in a four-chapter series on the fundamentals of game theory. If you haven’t checked out the first two chapters yet, I’d encourage you to do that to become familiar with the basic terms and concepts used in the following. If you feel ready, let’s go ahead!

Mixed Strategies

To the best of my knowledge, soccer is all about hitting the goal, although that happens very infrequently. Photo by Zainu Color on Unsplash

So far we have always considered games where each player chooses exactly one action. Now we will extend our games by allowing each player to select different actions with given probabilities, which we call a mixed strategy. If you play rock-paper-scissors, you do not know which action your opponent takes, but you might guess that they select each action with a probability of 33%, and if you play 99 games of rock-paper-scissors, you might indeed find your opponent to choose each action roughly 33 times. With this example, you directly see the main reasons why we want to introduce probability. First, it allows us to describe games that are played multiple times, and second, it enables us to consider a notion of the (assumed) likelihood of a player’s actions. 

Let me demonstrate the later point in more detail. We come back to the soccer game we saw in chapter 2, where the keeper decides on a corner to jump into and the other player decides on a corner to aim for.

A game matrix for a penalty shooting.

If you are the keeper, you win (reward of 1) if you choose the same corner as the opponent and you lose (reward of -1) if you choose the other one. For your opponent, it is the other way round: They win, if you select different corners. This game only makes sense, if both the keeper and the opponent select a corner randomly. To be precise, if one player knows that the other always selects the same corner, they know exactly what to do to win. So, the key to success in this game is to choose the corner by some random mechanism. The main question now is, what probability should the keeper and the opponent assign to both corners? Would it be a good strategy to choose the right corner with a probability of 80%? Probably not. 

To find the best strategy, we need to find the Nash equilibrium, because that is the state where no player can get any better by changing their behaviour. In the case of mixed strategies, such a Nash Equilibrium is described by a probability distribution over the actions, where no player wants to increase or decrease any probability anymore. In other words, it is optimal (because if it were not optimal, one player would like to change). We can find this optimal probability distribution if we consider the expected reward. As you might guess, the expected reward is composed of the reward (also called utility) the players get (which is given in the matrix above) times the likelihood of that reward. Let’s say the shooter chooses the left corner with probability p and the right corner with probability 1-p. What reward can the keeper expect? Well, if they choose the left corner, they can expect a reward of p*1 + (1-p)*(-1). Do you see how this is derived from the game matrix? If the keeper chooses the left corner, there is a probability of p, that the shooter chooses the same corner, which is good for the keeper (reward of 1). But with a chance of (1-p), the shooter chooses the other corner and the keeper loses (reward of -1). In a likewise fashion, if the keeper chooses the right corner, he can expect a reward of (1-p)*1 + p*(-1). Consequently, if the keeper chooses the left corner with probability q and the right corner with probability (1-q), the overall expected reward for the keeper is q times the expected reward for the left corner plus (1-q) times the reward for the right corner. 

Now let’s take the perspective of the shooter. He wants the keeper to be indecisive between the corners. In other words, he wants the keeper to see no advantage in any corner so he chooses randomly. Mathematically that means that the expected rewards for both corners should be equal, i.e.

which can be solved to p=0.5. So the optimal strategy for the shooter to keep the keeper indecisive is to choose the right corner with a Probability of p=0.5 and hence choose the left corner with an equal probability of p=0.5. 

But now imagine a shooter who is well known for his tendency to choose the right corner. You might not expect a 50/50 probability for each corner, but you assume he will choose the right corner with a probability of 70%. If the keeper stays with their 50/50 split for choosing a corner, their expected reward is 0.5 times the expected reward for the left corner plus 0.5 times the expected reward for the right corner:

That does not sound too bad, but there is a better option still. If the keeper always chooses the right corner (i.e., q=1), they get a reward of 0.4, which is better than 0. In this case, there is a clear best answer for the keeper which is to favour the corner the shooter prefers. That, however, would lower the shooter’s reward. If the keeper always chooses the right corner, the shooter would get a reward of -1 with a probability of 70% (because the shooter themself chooses the right corner with a probability of 70%) and a reward of 1 in the remaining 30% of cases, which yields an expected reward of 0.7*(-1) + 0.3*1 = -0.4. That is worse than the reward of 0 they got when they chose 50/50. Do you remember that a Nash equilibrium is a state, where no player has any reason to change his action unless any other player does? This scenario is not a Nash equilibrium, because the shooter has an incentive to change his action more towards a 50/50 split, even if the keeper does not change his strategy. This 50/50 split, however, is a Nash equilibrium, because in that scenario neither the shooter nor the keeper gains anything from changing their probability of choosing the one or the other corner. 

Fighting birds

Food can be a reason for birds to fight each other. Photo by Viktor Keri on Unsplash

From the previous example we saw, that a player’s assumptions about the other player’s actions influence the first player’s action selection as well. If a player wants to behave rationally (and this is what we always expect in game theory), they would choose actions such that they maximize their expected reward given the other players’ mixed action strategies. In the soccer scenario it is quite simple to more often jump into a corner, if you assume that the opponent will choose that corner more often, so let us continue with a more complicated example, that takes us outside into nature. 

As we walk across the forest, we notice some interesting behaviour in wild animals. Say two birds come to a place where there is something to eat. If you were a bird, what would you do? Would you share the food with the other bird, which means less food for both of you? Or would you fight? If you threaten your opponent, they might give in and you have all the food for yourself. But if the other bird is as aggressive as you, you end up in a real fight and you hurt each other. Then again you might have preferred to give in in the first place and just leave without a fight. As you see, the outcome of your action depends on the other bird. Preparing to fight can be very rewarding if the opponent gives in, but very costly if the other bird is willing to fight as well. In matrix notation, this game looks like this:

A matrix for a game that is someties called hawk vs. dove.

The question is, what would be the rational behaviour for a given distribution of birds who fight or give in? If you are in a very dangerous environment, where most birds are known to be aggressive fighters, you might prefer giving in to not get hurt. But if you assume that most other birds are cowards, you might see a potential benefit in preparing for a fight to scare the others away. By calculating the expected reward, we can figure out the exact proportions of birds fighting and birds giving in, which forms an equilibrium. Say the probability to fight is denoted p for bird 1 and q for bird 2, then the probability for giving in is 1-p for bird 1 and 1-q for bird 2. In a Nash equilibrium, no player wants to change their strategies unless any other payer does. Formally that means, that both options need to yield the same expected reward. So, for bird 2 fighting with a probability of q needs to be as good as giving in with a probability of (1-q). This leads us to the following formula we can solve for q:

For bird 2 it would be optimal to fight with a probability of 1/3 and give in with a probability of 2/3, and the same holds for bird 1 because of the symmetry of the game. In a big population of birds, that would mean that a third of the birds are fighters, who usually seek the fight and the other two-thirds prefer giving in. As this is an equilibrium, these ratios will stay stable over time. If it were to happen that more birds became cowards who always give in, fighting would become more rewarding, as the chance of winning increased. Then, however, more birds would choose to fight and the number of cowardly birds decreases and the stable equilibrium is reached again. 

Report a crime

There is nothing to see here. Move on and learn more about game theory. Photo by JOSHUA COLEMAN on Unsplash

Now that we have understood that we can find optimal Nash equilibria by comparing the expected rewards for the different options, we will use this strategy on a more sophisticated example to unleash the power game theory analyses can have for realistic complex scenarios. 

Say a crime happened in the middle of the city centre and there are multiple witnesses to it. The question is, who calls the police now? As there are many people around, everybody might expect others to call the police and hence refrain from doing it themself. We can model this scenario as a game again. Let’s say we have n players and everybody has two options, namely calling the police or not calling it. And what is the reward? For the reward, we distinguish three cases. If nobody calls the police, the reward is zero, because then the crime is not reported. If you call the police, you have some costs (e.g. the time you have to spend to wait and tell the police what happened), but the crime is reported which helps keep your city safe. If somebody else reports the crime, the city would still be kept safe, but you didn’t have the costs of calling the police yourself. Formally, we can write this down as follows:

v is the reward of keeping the city safe, which you get either if somebody else calls the police (first row) or if you call the police yourself (second row). However, in the second case, your reward is diminished a little by the costs c you have to take. However, let us assume that c is smaller than v, which means, that the costs of calling the police never exceed the reward you get from keeping your city safe. In the last case, where nobody calls the police, your reward is zero.

This game looks a little different from the previous ones we had, mainly because we didn’t display it as a matrix. In fact, it is more complicated. We didn’t specify the exact number of players (we just called it n), and we also didn’t specify the rewards explicitly but just introduced some values v and c. However, this helps us model a quite complicated real situation as a game and will allow us to answer more interesting questions: First, what happens if more people witness the crime? Will it become more likely that somebody will report the crime? Second, how do the costs c influence the likelihood of the crime being reported? We can answer these questions with the game-theoretic concepts we have learned already. 

As in the previous examples, we will use the Nash equilibrium’s property that in an optimal state, nobody should want to change their action. That means, for every individual calling the police should be as good as not calling it, which leads us to the following formula:

On the left, you have the reward if you call the police yourself (v-c). This should be as good as a reward of v times the likelihood that anybody else calls the police. Now, the probability of anybody else calling the police is the same as 1 minus the probability that nobody else calls the police. If we denote the probability that an individual calls the police with p, the probability that a single individual does not call the police is 1-p. Consequently, the probability that two individuals don’t call the police is the product of the single probabilities, (1-p)*(1-p). For n-1 individuals (all individuals except you), this gives us the term 1-p to the power of n-1 in the last row. We can solve this equation and finally arrive at:

This last row gives you the probability of a single individual calling the police. What happens, if there are more witnesses to the crime? If n gets larger, the exponent becomes smaller (1/n goes towards 0), which finally leads to:

Given that x to the power of 0 is always 1, p becomes zero. In other words, the more witnesses are around (higher n), the less likely it becomes that you call the police, and for an infinite amount of other witnesses, the probability drops to zero. This sounds reasonable. The more other people around, the more likely you are to expect that anybody else will call the police and the smaller you see your responsibility. Naturally, all other individuals will have the same chain of thought. But that also sounds a little tragic, doesn’t it? Does this mean that nobody will call the police if there are many witnesses? 

Well, not necessarily. We just saw that the probability of a single person calling the police declines with higher n, but there are still more people around. Maybe the sheer number of people around counteracts this diminishing probability. A hundred people with a small probability of calling the police each might still be worth more than a few people with moderate individual probabilities. Let us now take a look at the probability that anybody calls the police.

The probability that anybody calls the police is equal to 1 minus the probability that nobody calls the police. Like in the example before, the probability of nobody calling the police is described by 1-p to the power of n. We then use an equation we derived previously (see formulas above) to replace (1-p)^(n-1) with c/v. 

When we look at the last line of our calculations, what happens for big n now? We already know that p drops to zero, leaving us with a probability of 1-c/v. This is the likelihood that anybody will call the police if there are many people around (note that this is different from the probability that a single individual calls the police). We see that this likelihood heavily depends on the ratio of c and v. The smaller c, the more likely it is that anybody calls the police. If c is (close to) zero, it is almost certain that the police will be called, but if c is almost as big as v (that is, the costs of calling the police eat up the reward of reporting the crime), it becomes unlikely that anybody calls the police. This gives us a lever to influence the probability of reporting crimes. Calling the police and reporting a crime should be as effortless and low-threshold as possible.

Summary

We have learned a lot about probabilities and choosing actions randomly today. Photo by Robert Stump on Unsplash

In this chapter on our journey through the realms of game theory, we have introduced so-called mixed strategies, which allowed us to describe games by the probabilities with which different actions are taken. We can summarize our key findings as follows: 

  • A mixed strategy is described by a probability distribution over the different actions.
  • In a Nash equilibrium, the expected reward for all actions a player can take must be equal.
  • In mixed strategies, a Nash equilibrium means that no player wants to change the probabilities of their actions
  • We can find out the probabilities of different actions in a Nash equilibrium by setting the expected rewards of two (or more) options equal.
  • Game-theoretic concepts allow us to analyze scenarios with an infinite amount of players. Such analyses can also tell us how the exact shaping of the reward can influence the probabilities in a Nash equilibrium. This can be used to inspire decisions in the real world, as we saw in the crime reporting example.

We are almost through with our series on the fundamentals of game theory. In the next and final chapter, we will introduce the idea of taking turns in games. Stay tuned!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

  • Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in English language could be this one:

  • Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

  • Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post When You Just Can’t Decide on a Single Action appeared first on Towards Data Science.

]]>
How to Spot and Prevent Model Drift Before it Impacts Your Business https://towardsdatascience.com/how-to-spot-and-prevent-model-drift-before-it-impacts-your-business/ Thu, 06 Mar 2025 19:22:22 +0000 https://towardsdatascience.com/?p=598826 3 essential methods to track model drift you should know

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection. 

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

  • Tracking model performance
  • Monitoring feature distributions
  • Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

  • Transaction Amount: Log-normal distribution (right-skewed with a long tail)
  • Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
  • Time Since Last Transaction: Exponential distribution
  • Transaction Count: Poisson distribution
  • Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

  • Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.
Distribution of drifted features (image by author)
  • Stable feature: Time Since Last Transaction remained unchanged.
Distribution of stable feature (image by author)
  • Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.
Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

  • PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
  • 0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
  • 0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
  • PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.
Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

  • 0 indicates no difference between the two distributions.
  • Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations with lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

  • Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
  • P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

  • Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
  • Production correlation (prod_corr): Captures new feature relationships in production.
  • Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

  • In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
  • In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
  • Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction amounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

  • Encoder: Compresses input data into a lower-dimensional representation.
  • Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

  • Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to what it has seen and learned.
  • High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

  • Model score drift: Using Population Stability Index (PSI)
  • Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
  • Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?


Want to build your AI skills?

👉🏻 I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.


Resources

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
One-Tailed Vs. Two-Tailed Tests https://towardsdatascience.com/one-tailed-vs-two-tailed-tests/ Thu, 06 Mar 2025 04:22:42 +0000 https://towardsdatascience.com/?p=598815 Choosing between one- and two-tailed hypotheses affects every stage of A/B testing. Learn why the hypothesis direction matters and explore the pros and cons of each approach.

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

]]>
Introduction

If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for you!

The alternative hypothesis parameter, commonly referred to as “one-tailed” versus “two-tailed” in statistics, defines the expected direction of the difference between control and treatment groups. In a two-tailed test, we assess whether there is any difference in mean values between the groups, without specifying a direction. A one-tailed test, on the other hand, posits a specific direction—whether the control group’s mean is either less than or greater than that of the treatment group.

Choosing between one- and two-tailed hypotheses might seem like a minor detail, but it affects every stage of A/B testing: from test planning to Data Analysis and results interpretation. This article builds a theoretical foundation on why the hypothesis direction matters and explores the pros and cons of each approach.

One-tailed vs. two-tailed hypothesis testing: Understanding the difference

To understand the importance of choosing between one-tailed and two-tailed hypotheses, let’s briefly review the basics of the t-test, the commonly used method in A/B testing. Like other Hypothesis Testing methods, the t-test begins with a conservative assumption: there is no difference between the two groups (the null hypothesis). Only if we find strong evidence against this assumption can we reject the null hypothesis and conclude that the treatment has had an effect.

But what qualifies as “strong evidence”? To that end, a rejection region is determined under the null hypothesis and all results that fall within this region are deemed so unlikely that we take them as evidence against the feasibility of the null hypothesis. The size of this rejection region is based on a predetermined probability, known as alpha (α), which represents the likelihood of incorrectly rejecting the null hypothesis. 

What does this have to do with the direction of the alternative hypothesis? Quite a bit, actually. While the alpha level determines the size of the rejection region, the alternative hypothesis dictates its placement. In a one-tailed test, where we hypothesize a specific direction of difference, the rejection region is situated in only one tail of the distribution. For a hypothesized positive effect (e..g., that the treatment group mean is higher than the control group mean), the rejection region would lie in the right tail, creating a right-tailed test. Conversely, if we hypothesize a negative effect (e.g., that the treatment group mean is less than the control group mean), the rejection region would be placed in the left tail, resulting in a left-tailed test.

In contrast, a two-tailed test allows for the detection of a difference in either direction, so the rejection region is split between both tails of the distribution. This accommodates the possibility of observing extreme values in either direction, whether the effect is positive or negative.

To build intuition, let’s visualize how the rejection regions appear under the different hypotheses. Recall that according to the null hypothesis, the difference between the two groups should center around zero. Thanks to the central limit theorem, we also know this distribution approximates a normal distribution. Consequently, the rejection areas corresponding to the different alternative hypothesis look like that:

Why does it make a difference?

The choice of direction for the alternative hypothesis impacts the entire A/B testing process, starting with the planning phase—specifically, in determining the sample size. Sample size is calculated based on the desired power of the test, which is the probability of detecting a true difference between the two groups when one exists. To compute power, we examine the area under the alternative hypothesis that corresponds to the rejection region (since power reflects the ability to reject the null hypothesis when the alternative hypothesis is true).

Since the direction of the hypothesis affects the size of this rejection region, power is generally lower for a two-tailed hypothesis. This is due to the rejection region being divided across both tails, making it more challenging to detect an effect in any one direction. The following graph illustrates the comparison between the two types of hypotheses. Note that the purple area is larger for the one-tailed hypothesis, compared to the two-tailed hypothesis:

In practice, to maintain the desired power level, we compensate for the reduced power of a two-tailed hypothesis by increasing the sample size (Increasing sample size raises power, though the mechanics of this can be a topic for a separate article). Thus, the choice between one- and two-tailed hypotheses directly influences the required sample size for your test. 

Beyond the planning phase, the choice of alternative hypothesis directly impacts the analysis and interpretation of results. There are cases where a test may reach significance with a one-tailed approach but not with a two-tailed one, and vice versa. Reviewing the previous graph can help illustrate this: for example, a result in the left tail might be significant under a two-tailed hypothesis but not under a right one-tailed hypothesis. Conversely, certain results might fall within the rejection region of a right one-tailed test but lie outside the rejection area in a two-tailed test.

How to decide between a one-tailed and two-tailed hypothesis

Let’s start with the bottom line: there’s no absolute right or wrong choice here. Both approaches are valid, and the primary consideration should be your specific business needs. To help you decide which option best suits your company, we’ll outline the key pros and cons of each.

At first glance, a one-tailed alternative may appear to be the clear choice, as it often aligns better with business objectives. In industry applications, the focus is typically on improving specific metrics rather than exploring a treatment’s impact in both directions. This is especially relevant in A/B testing, where the goal is often to optimize conversion rates or enhance revenue. If the treatment doesn’t lead to a significant improvement the examined change won’t be implemented.

Beyond this conceptual advantage, we have already mentioned one key benefit of a one-tailed hypothesis: it requires a smaller sample size. Thus, choosing a one-tailed alternative can save both time and resources. To illustrate this advantage, the following graphs show the required sample sizes for one- and two-tailed hypotheses with different power levels (alpha is set at 5%).

In this context, the decision between one- and two-tailed hypotheses becomes particularly important in sequential testing—a method that allows for ongoing data analysis without inflating the alpha level. Here, selecting a one-tailed test can significantly reduce the duration of the test, enabling faster decision-making, which is especially valuable in dynamic business environments where prompt responses are essential.

However, don’t be too quick to dismiss the two-tailed hypothesis! It has its own advantages. In some business contexts, the ability to detect “negative significant results” is a major benefit. As one client once shared, he preferred negative significant results over inconclusive ones because they offer valuable learning opportunities. Even if the outcome wasn’t as expected, he could conclude that the treatment had a negative effect and gain insights into the product.

Another benefit of two-tailed tests is their straightforward interpretation using confidence intervals (CIs). In two-tailed tests, a CI that doesn’t include zero directly indicates significance, making it easier for practitioners to interpret results at a glance. This clarity is particularly appealing since CIs are widely used in A/B testing platforms. Conversely, with one-tailed tests, a significant result might still include zero in the CI, potentially leading to confusion or mistrust in the findings. Although one-sided confidence intervals can be employed with one-tailed tests, this practice is less common.

Conclusions

By adjusting a single parameter, you can significantly impact your A/B testing: specifically, the sample size you need to collect and the interpretation of the results. When deciding between one- and two-tailed hypotheses, consider factors such as the available sample size, the advantages of detecting negative effects, and the convenience of aligning confidence intervals (CIs) with hypothesis testing. Ultimately, this decision should be made thoughtfully, taking into account what best fits your business needs.

(Note: all the images in this post were created by the author)

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

]]>