Mcmc | Towards Data Science

Are You Sure Your Posterior Makes Sense?

Felipe Bandeira — Fri, 11 Apr 2025 18:38:41 +0000

This article is co-authored by Felipe Bandeira, Giselle Fretta, Thu Than, and Elbion Redenica. We also thank Prof. Carl Scheffler for his support.

Introduction

Parameter estimation has been for decades one of the most important topics in statistics. While frequentist approaches, such as Maximum Likelihood Estimations, used to be the gold standard, the advance of computation has opened space for Bayesian methods. Estimating posterior distributions with Mcmc samplers became increasingly common, but reliable inferences depend on a task that is far from trivial: making sure that the sampler — and the processes it executes under the hood — worked as expected. Keeping in mind what Lewis Caroll once wrote: “If you don’t know where you’re going, any road will take you there.”

This article is meant to help data scientists evaluate an often overlooked aspect of Bayesian parameter estimation: the reliability of the sampling process. Throughout the sections, we combine simple analogies with technical rigor to ensure our explanations are accessible to data scientists with any level of familiarity with Bayesian methods. Although our implementations are in Python with PyMC, the concepts we cover are useful to anyone using an MCMC algorithm, from Metropolis-Hastings to NUTS.

Key Concepts

No data scientist or statistician would disagree with the importance of robust parameter estimation methods. Whether the objective is to make inferences or conduct simulations, having the capacity to model the data generation process is a crucial part of the process. For a long time, the estimations were mainly performed using frequentist tools, such as Maximum Likelihood Estimations (MLE) or even the famous Least Squares optimization used in regressions. Yet, frequentist methods have clear shortcomings, such as the fact that they are focused on point estimates and do not incorporate prior knowledge that could improve estimates.

As an alternative to these tools, Bayesian methods have gained popularity over the past decades. They provide statisticians not only with point estimates of the unknown parameter but also with confidence intervals for it, all of which are informed by the data and by the prior knowledge researchers held. Originally, Bayesian parameter estimation was done through an adapted version of Bayes’ theorem focused on unknown parameters (represented as θ) and known data points (represented as x). We can define P(θ|x), the posterior distribution of a parameter’s value given the data, as:

\[ P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)} \]

In this formula, P(x|θ) is the likelihood of the data given a parameter value, P(θ) is the prior distribution over the parameter, and P(x) is the evidence, which is computed by integrating all possible values of the prior:

\[ P(x) = \int_\theta P(x, \theta) d\theta \]

In some cases, due to the complexity of the calculations required, deriving the posterior distribution analytically was not possible. However, with the advance of computation, running sampling algorithms (especially MCMC ones) to estimate posterior distributions has become easier, giving researchers a powerful tool for situations where analytical posteriors are not trivial to find. Yet, with such power also comes a large amount of responsibility to ensure that results make sense. This is where sampler diagnostics come in, offering a set of valuable tools to gauge 1) whether an MCMC algorithm is working well and, consequently, 2) whether the estimated distribution we see is an accurate representation of the real posterior distribution. But how can we know so?

How samplers work

Before diving into the technicalities of diagnostics, we shall cover how the process of sampling a posterior (especially with an MCMC sampler) works. In simple terms, we can think of a posterior distribution as a geographical area we haven’t been to but need to know the topography of. How can we draw an accurate map of the region?

One of our favorite analogies comes from Ben Gilbert. Suppose that the unknown region is actually a house whose floorplan we wish to map. For some reason, we cannot directly visit the house, but we can send bees inside with GPS devices attached to them. If everything works as expected, the bees will fly around the house, and using their trajectories, we can estimate what the floor plan looks like. In this analogy, the floor plan is the posterior distribution, and the sampler is the group of bees flying around the house.

The reason we are writing this article is that, in some cases, the bees won’t fly as expected. If they get stuck in a certain room for some reason (because someone dropped sugar on the floor, for example), the data they return won’t be representative of the entire house; rather than visiting all rooms, the bees only visited a few, and our picture of what the house looks like will ultimately be incomplete. Similarly, when a sampler does not work correctly, our estimation of the posterior distribution is also incomplete, and any inference we draw based on it is likely to be wrong.

Monte Carlo Markov Chain (MCMC)

In technical terms, we call an MCMC process any algorithm that undergoes transitions from one state to another with certain properties. Markov Chain refers to the fact that the next state only depends on the current one (or that the bee’s next location is only influenced by its current place, and not by all of the places where it has been before). Monte Carlo means that the next state is chosen randomly. MCMC methods like Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo (HMC), and No-U-Turn Sampler (NUTS) all operate by constructing Markov Chains (a sequence of steps) that are close to random and gradually explore the posterior distribution.

Now that you understand how a sampler works, let’s dive into a practical scenario to help us explore sampling problems.

Case Study

Imagine that, in a faraway nation, a governor wants to understand more about public annual spending on healthcare by mayors of cities with less than 1 million inhabitants. Rather than looking at sheer frequencies, he wants to understand the underlying distribution explaining expenditure, and a sample of spending data is about to arrive. The problem is that two of the economists involved in the project disagree about how the model should look.

Model 1

The first economist believes that all cities spend similarly, with some variation around a certain mean. As such, he creates a simple model. Although the specifics of how the economist chose his priors are irrelevant to us, we do need to keep in mind that he is trying to approximate a Normal (unimodal) distribution.

\[
x_i \sim \text{Normal}(\mu, \sigma^2) \text{ i.i.d. for all } i \\
\mu \sim \text{Normal}(10, 2) \\
\sigma^2 \sim \text{Uniform}(0,5)
\]

Model 2

The second economist disagrees, arguing that spending is more complex than his colleague believes. He believes that, given ideological differences and budget constraints, there are two kinds of cities: the ones that do their best to spend very little and the ones that are not afraid of spending a lot. As such, he creates a slightly more complex model, using a mixture of normals to reflect his belief that the true distribution is bimodal.

\[
x_i \sim \text{Normal-Mixture}([\omega, 1-\omega], [m_1, m_2], [s_1^2, s_2^2]) \text{ i.i.d. for all } i\\
m_j \sim \text{Normal}(2.3, 0.5^2) \text{ for } j = 1,2 \\
s_j^2 \sim \text{Inverse-Gamma}(1,1) \text{ for } j=1,2 \\
\omega \sim \text{Beta}(1,1)
\]

After the data arrives, each economist runs an MCMC algorithm to estimate their desired posteriors, which will be a reflection of reality (1) if their assumptions are true and (2) if the sampler worked correctly. The first if, a discussion about assumptions, shall be left to the economists. However, how can they know whether the second if holds? In other words, how can they be sure that the sampler worked correctly and, as a consequence, their posterior estimations are unbiased?

Sampler Diagnostics

To evaluate a sampler’s performance, we can explore a small set of metrics that reflect different parts of the estimation process.

Quantitative Metrics

R-hat (Potential Scale Reduction Factor)

In simple terms, R-hat evaluates whether bees that started at different places have all explored the same rooms at the end of the day. To estimate the posterior, an MCMC algorithm uses multiple chains (or bees) that start at random locations. R-hat is the metric we use to assess the convergence of the chains. It measures whether multiple MCMC chains have mixed well (i.e., if they have sampled the same topography) by comparing the variance of samples within each chain to the variance of the sample means across chains. Intuitively, this means that

\[
\hat{R} = \sqrt{\frac{\text{Variance Between Chains}}{\text{Variance Within Chains}}}
\]

If R-hat is close to 1.0 (or below 1.01), it means that the variance within each chain is very similar to the variance between chains, suggesting that they have converged to the same distribution. In other words, the chains are behaving similarly and are also indistinguishable from one another. This is precisely what we see after sampling the posterior of the first model, shown in the last column of the table below:

Figure 1. Summary statistics of the sampler highlighting ideal R-hats.

The r-hat from the second model, however, tells a different story. The fact we have such large r-hat values indicates that, at the end of the sampling process, the different chains had not converged yet. In practice, this means that the distribution they explored and returned was different, or that each bee created a map of a different room of the house. This fundamentally leaves us without a clue of how the pieces connect or what the complete floor plan looks like.

Figure 2. Summary statistics of the sampler showcasing problematic R-hats.

Given our R-hat readouts were large, we know something went wrong with the sampling process in the second model. However, even if the R-hat had turned out within acceptable levels, this does not give us certainty that the sampling process worked. R-hat is just a diagnostic tool, not a guarantee. Sometimes, even if your R-hat readout is lower than 1.01, the sampler might not have properly explored the full posterior. This happens when multiple bees start their exploration in the same room and remain there. Likewise, if you’re using a small number of chains, and if your posterior happens to be multimodal, there is a probability that all chains started in the same mode and failed to explore other peaks.

The R-hat readout reflects convergence, not completion. In order to have a more comprehensive idea, we need to check other diagnostic metrics as well.

Effective Sample Size (ESS)

When explaining what MCMC was, we mentioned that “Monte Carlo” refers to the fact that the next state is chosen randomly. This does not necessarily mean that the states are fully independent. Even though the bees choose their next step at random, these steps are still correlated to some extent. If a bee is exploring a living room at time t=0, it will probably still be in the living room at time t=1, even though it is in a different part of the same room. Due to this natural connection between samples, we say these two data points are autocorrelated.

Due to their nature, MCMC methods inherently produce autocorrelated samples, which complicates statistical analysis and requires careful evaluation. In statistical inference, we often assume independent samples to ensure that the estimates of uncertainty are accurate, hence the need for uncorrelated samples. If two data points are too similar to each other, the correlation reduces their effective information content. Mathematically, the formula below represents the autocorrelation function between two time points (t1 and t2) in a random process:

\[
R_{XX}(t_1, t_2) = E[X_{t_1} \overline{X_{t_2}}]
\]

where E is the expected value operator and X-bar is the complex conjugate. In MCMC sampling, this is crucial because high autocorrelation means that new samples don’t teach us anything different from the old ones, effectively reducing the sample size we have. Unsurprisingly, the metric that reflects this is called Effective Sample Size (ESS), and it helps us determine how many truly independent samples we have.

As hinted previously, the effective sample size accounts for autocorrelation by estimating how many truly independent samples would provide the same information as the autocorrelated samples we have. Mathematically, for a parameter θ, the ESS is defined as:

\[
ESS = \frac{n}{1 + 2 \sum_{k=1}^{\infty} \rho(\theta)_k}
\]

where n is the total number of samples and ρ(θ)k is the autocorrelation at lag k for parameter θ.

Typically, for ESS readouts, the higher, the better. This is what we see in the readout for the first model. Two common ESS variations are Bulk-ESS, which assesses mixing in the central part of the distribution, and Tail-ESS, which focuses on the efficiency of sampling the distribution’s tails. Both inform us if our model accurately reflects the central tendency and credible intervals.

Figure 3. Summary statistics of the sampler highlighting ideal quantities for ESS bulk and tail.

In contrast, the readouts for the second model are very bad. Typically, we want to see readouts that are at least 1/10 of the total sample size. In this case, given each chain sampled 2000 observations, we should expect ESS readouts of at least 800 (from the total size of 8000 samples across 4 chains of 2000 samples each), which is not what we observe.

Figure 4. Summary statistics of the sampler demonstrating problematic ESS bulk and tail.

Visual Diagnostics

Apart from the numerical metrics, our understanding of sampler performance can be deepened through the use of diagnostic plots. The main ones are rank plots, trace plots, and pair plots.

Rank Plots

A rank plot helps us identify whether the different chains have explored all of the posterior distribution. If we once again think of the bee analogy, rank plots tell us which bees explored which parts of the house. Therefore, to evaluate whether the posterior was explored equally by all chains, we observe the shape of the rank plots produced by the sampler. Ideally, we want the distribution of all chains to look roughly uniform, like in the rank plots generated after sampling the first model. Each color below represents a chain (or bee):

Figure 5. Rank plots for parameters ‘m’ and ‘s’ across four MCMC chains. Each bar represents the distribution of rank values for one chain, with ideally uniform ranks indicating good mixing and proper convergence.

Under the hood, a rank plot is produced with a simple sequence of steps. First, we run the sampler and let it sample from the posterior of each parameter. In our case, we are sampling posteriors for parameters m and s of the first model. Then, parameter by parameter, we get all samples from all chains, put them together, and order them from smallest to largest. We then ask ourselves, for each sample, what was the chain where it came from? This will allow us to create plots like the ones we see above.

In contrast, bad rank plots are easy to spot. Unlike the previous example, the distributions from the second model, shown below, are not uniform. From the plots, what we interpret is that each chain, after beginning at different random locations, got stuck in a region and did not explore the entirety of the posterior. Consequently, we cannot make inferences from the results, as they are unreliable and not representative of the true posterior distribution. This would be equivalent to having four bees that started at different rooms of the house and got stuck somewhere during their exploration, never covering the entirety of the property.

Figure 6. Rank plots for parameters m, s_squared, and w across four MCMC chains. Each subplot shows the distribution of ranks by chain. There are noticeable deviations from uniformity (e.g., stair-step patterns or imbalances across chains) suggesting potential sampling issues.

KDE and Trace Plots

Similar to R-hat, trace plots help us assess the convergence of MCMC samples by visualizing how the algorithm explores the parameter space over time. PyMC provides two types of trace plots to diagnose mixing issues: Kernel Density Estimate (KDE) plots and iteration-based trace plots. Each of these serves a distinct purpose in evaluating whether the sampler has properly explored the target distribution.

The KDE plot (usually on the left) estimates the posterior density for each chain, where each line represents a separate chain. This allows us to check whether all chains have converged to the same distribution. If the KDEs overlap, it suggests that the chains are sampling from the same posterior and that mixing has occurred. On the other hand, the trace plot (usually on the right) visualizes how parameter values change over MCMC iterations (steps), with each line representing a different chain. A well-mixed sampler will produce trace plots that look noisy and random, with no clear structure or separation between chains.

Using the bee analogy, trace plots can be thought of as snapshots of the “features” of the house at different locations. If the sampler is working correctly, the KDEs in the left plot should align closely, showing that all bees (chains) have explored the house similarly. Meanwhile, the right plot should show highly variable traces that blend together, confirming that the chains are actively moving through the space rather than getting stuck in specific regions.

Figure 7. Density and trace plots for parameters m and s from the first model across four MCMC chains. The left panel shows kernel density estimates (KDE) of the marginal posterior distribution for each chain, indicating consistent central tendency and spread. The right panel displays the trace plot over iterations, with overlapping chains and no apparent divergences, suggesting good mixing and convergence.

However, if your sampler has poor mixing or convergence issues, you will see something like the figure below. In this case, the KDEs will not overlap, meaning that different chains have sampled from different distributions rather than a shared posterior. The trace plot will also show structured patterns instead of random noise, indicating that chains are stuck in different regions of the parameter space and failing to fully explore it.

Figure 8. KDE (left) and trace plots (right) for parameters m, s_squared, and w across MCMC chains for the second model. Multimodal distributions are visible for m and w, suggesting potential identifiability issues. Trace plots reveal that chains explore different modes with limited mixing, particularly for m, highlighting challenges in convergence and effective sampling.

By using trace plots alongside the other diagnostics, you can identify sampling issues and determine whether your MCMC algorithm is effectively exploring the posterior distribution.

Pair Plots

A third kind of plot that is often useful for diagnostic are pair plots. In models where we want to estimate the posterior distribution of multiple parameters, pair plots allow us to observe how different parameters are correlated. To understand how such plots are formed, think again about the bee analogy. If you imagine that we’ll create a plot with the width and length of the house, each “step” that the bees take can be represented by an (x, y) combination. Likewise, each parameter of the posterior is represented as a dimension, and we create scatter plots showing where the sampler walked using parameter values as coordinates. Here, we are plotting each unique pair (x, y), resulting in the scatter plot you see in the middle of the image below. The one-dimensional plots you see on the edges are the marginal distributions over each parameter, giving us additional information on the sampler’s behavior when exploring them.

Take a look at the pair plot from the first model.

Figure 9. Joint posterior distribution of parameters m and s, with marginal densities. The scatter plot shows a roughly symmetric, elliptical shape, suggesting a low correlation between m and s.

Each axis represents one of the two parameters whose posteriors we are estimating. For now, let’s focus on the scatter plot in the middle, which shows the parameter combinations sampled from the posterior. The fact we have a very even distribution means that, for any particular value of m, there was a range of values of s that were equally likely to be sampled. Additionally, we don’t see any correlation between the two parameters, which is usually good! There are cases when we would expect some correlation, such as when our model involves a regression line. However, in this instance, we have no reason to believe two parameters should be highly correlated, so the fact we don’t observe unusual behavior is positive news.

Now, take a look at the pair plots from the second model.

Figure 10. Pair plot of the joint posterior distributions for parameters m, s_squared, and w. The scatter plots reveal strong correlations between several parameters.

Given that this model has five parameters to be estimated, we naturally have a greater number of plots since we are analyzing them pair-wise. However, they look odd compared to the previous example. Namely, rather than having an even distribution of points, the samples here either seem to be divided across two regions or seem somewhat correlated. This is another way of visualizing what the rank plots have shown: the sampler did not explore the full posterior distribution. Below we isolated the top left plot, which contains the samples from m0 and m1. Unlike the plot from model 1, here we see that the value of one parameter greatly influences the value of the other. If we sampled m1 around 2.5, for example, m0 is likely to be sampled from a very narrow range around 1.5.

Figure 11. Joint posterior distribution of parameters m₀ and m₁, with marginal densities.

Certain shapes can be observed in problematic pair plots relatively frequently. Diagonal patterns, for example, indicate a high correlation between parameters. Banana shapes are often connected to parametrization issues, often being present in models with tight priors or constrained parameters. Funnel shapes might indicate hierarchical models with bad geometry. When we have two separate islands, like in the plot above, this can indicate that the posterior is bimodal AND that the chains haven’t mixed well. However, keep in mind that these shapes might indicate problems, but not necessarily do so. It’s up to the data scientist to examine the model and determine which behaviors are expected and which ones are not!

Some Fixing Techniques

When your diagnostics indicate sampling problems — whether concerning R-hat values, low ESS, unusual rank plots, separated trace plots, or strange parameter correlations in pair plots — several strategies can help you address the underlying issues. Sampling problems typically stem from the target posterior being too complex for the sampler to explore efficiently. Complex target distributions might have:

Multiple modes (peaks) that the sampler struggles to move between
Irregular shapes with narrow “corridors” connecting different regions
Areas of drastically different scales (like the “neck” of a funnel)
Heavy tails that are difficult to sample accurately

In the bee analogy, these complexities represent houses with unusual floor plans — disconnected rooms, extremely narrow hallways, or areas that change dramatically in size. Just as bees might get trapped in specific regions of such houses, MCMC chains can get stuck in certain areas of the posterior.

Figure 12. Examples of multimodal target distributions.

Figure 13. Examples of weirdly shaped distributions.

To help the sampler in its exploration, there are simple strategies we can use.

Strategy 1: Reparameterization

Reparameterization is particularly effective for hierarchical models and distributions with challenging geometries. It involves transforming your model’s parameters to make them easier to sample. Back to the bee analogy, imagine the bees are exploring a house with a peculiar layout: a spacious living room that connects to the kitchen through a very, very narrow hallway. One aspect we hadn’t mentioned before is that the bees have to fly in the same way through the entire house. That means that if we dictate the bees should use large “steps,” they will explore the living room very well but hit the walls in the hallway head-on. Likewise, if their steps are small, they will explore the narrow hallway well, but take forever to cover the entire living room. The difference in scales, which is natural to the house, makes the bees’ job more difficult.

A classic example that represents this scenario is Neal’s funnel, where the scale of one parameter depends on another:

\[
p(y, x) = \text{Normal}(y|0, 3) \times \prod_{n=1}^{9} \text{Normal}(x_n|0, e^{y/2})
\]

Figure 14. Log the marginal density of y and the first dimension of Neal’s funnel. The neck is where the sampler is struggling to sample from and the step size is required to be much smaller than the body. (Image source: Stan User’s Guide)

We can see that the scale of x is dependent on the value of y. To fix this problem, we can separate x and y as independent standard Normals and then transform these variables into the desired funnel distribution. Instead of sampling directly like this:

\[
\begin{align*}
y &\sim \text{Normal}(0, 3) \\
x &\sim \text{Normal}(0, e^{y/2})
\end{align*}
\]

You can reparameterize to sample from standard Normals first:

\[
y_{raw} \sim \text{Standard Normal}(0, 1) \\
x_{raw} \sim \text{Standard Normal}(0, 1) \\
\\
y = 3y_{raw} \\
x = e^{y/2} x_{raw}
\]

This technique separates the hierarchical parameters and makes sampling more efficient by eliminating the dependency between them.

Reparameterization is like redesigning the house such that instead of forcing the bees to find a single narrow hallway, we create a new layout where all passages have similar widths. This helps the bees use a consistent flying pattern throughout their exploration.

Strategy 2: Handling Heavy-tailed Distributions

Heavy-tailed distributions like Cauchy and Student-T present challenges for samplers and the ideal step size. Their tails require larger step sizes than their central regions (similar to very long hallways that require the bees to travel long distances), which creates a challenge:

Small step sizes lead to inefficient sampling in the tails
Large step sizes cause too many rejections in the center

Figure 15. Probability density functions for various Cauchy distributions illustrate the effects of changing the location parameter and scale parameter. (Image source: Wikipedia)

Reparameterization solutions include:

For Cauchy: Defining the variable as a transformation of a Uniform distribution using the Cauchy inverse CDF
For Student-T: Using a Gamma-Mixture representation

Strategy 3: Hyperparameter Tuning

Sometimes the solution lies in adjusting the sampler’s hyperparameters:

Increase total iterations: The simplest approach — give the sampler more time to explore.
Increase target acceptance rate (adapt_delta): Reduce divergent transitions (try 0.9 instead of the default 0.8 for complex models, for example).
Increase max_treedepth: Allow the sampler to take more steps per iteration.
Extend warmup/adaptation phase: Give the sampler more time to adapt to the posterior geometry.

Remember that while these adjustments may improve your diagnostic metrics, they often treat symptoms rather than underlying causes. The previous strategies (reparameterization and better proposal distributions) typically offer more fundamental solutions.

Strategy 4: Better Proposal Distributions

This solution is for function fitting processes, rather than sampling estimations of the posterior. It basically asks the question: “I’m currently here in this landscape. Where should I jump to next so that I explore the full landscape, or how do I know that the next jump is the jump I should make?” Thus, choosing a good distribution means making sure that the sampling process explores the full parameter space instead of just a specific region. A good proposal distribution should:

Have substantial probability mass where the target distribution does.
Allow the sampler to make jumps of the appropriate size.

One common choice of the proposal distribution is the Gaussian (Normal) distribution with mean μ and standard deviation σ — the scale of the distribution that we can tune to decide how far to jump from the current position to the next position. If we choose the scale for the proposal distribution to be too small, it might either take too long to explore the entire posterior or it will get stuck in a region and never explore the full distribution. But if the scale is too large, you might never get to explore some regions, jumping over them. It’s like playing ping-pong where we only reach the two edges but not the middle.

Improve Prior Specification

When all else fails, reconsider your model’s prior specifications. Vague or weakly informative priors (like uniformly distributed priors) can sometimes lead to sampling difficulties. More informative priors, when justified by domain knowledge, can help guide the sampler toward more reasonable regions of the parameter space. Sometimes, despite your best efforts, a model may remain challenging to sample effectively. In such cases, consider whether a simpler model might achieve similar inferential goals while being more computationally tractable. The best model is often not the most complex one, but the one that balances complexity with reliability. The table below shows the summary of fixing strategies for different issues.

Diagnostic Signal	Potential Issue	Recommended Fix
High R-hat	Poor mixing between chains	Increase iterations, adjust the step size
Low ESS	High autocorrelation	Reparameterization, increase adapt_delta
Non-uniform rank plots	Chains stuck in different regions	Better proposal distribution, start with multiple chains
Separated KDEs in trace plots	Chains exploring different distributions	Reparameterization
Funnel shapes in pair plots	Hierarchical model issues	Non-centered reparameterization
Disjoint clusters in pair plots	Multimodality with poor mixing	Adjusted distribution, simulated annealing

Conclusion

Assessing the quality of MCMC sampling is crucial for ensuring reliable inference. In this article, we explored key diagnostic metrics such as R-hat, ESS, rank plots, trace plots, and pair plots, discussing how each helps determine whether the sampler is performing properly.

If there’s one takeaway we want you to keep in mind it’s that you should always run diagnostics before drawing conclusions from your samples. No single metric provides a definitive answer — each serves as a tool that highlights potential issues rather than proving convergence. When problems arise, strategies such as reparameterization, hyperparameter tuning, and prior specification can help improve sampling efficiency.

By combining these diagnostics with thoughtful modeling decisions, you can ensure a more robust analysis, reducing the risk of misleading inferences due to poor sampling behavior.

References

B. Gilbert, Bob’s bees: the importance of using multiple bees (chains) to judge MCMC convergence (2018), Youtube

Chi-Feng, MCMC demo (n.d.), GitHub

D. Simpson, Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it. (2019), Statistical Modeling, Causal Inference, and Social Science

M. Taboga, Markov Chain Monte Carlo (MCMC) methods (2021), Lectures on probability theory and mathematical Statistics. Kindle Direct Publishing.

T. Wiecki, MCMC Sampling for Dummies (2024), twecki.io
Stan User’s Guide, Reparametrization (n.d.), Stan Documentation

The post Are You Sure Your Posterior Makes Sense? appeared first on Towards Data Science.

Bayesian Inference and Markov Chain Monte Carlo Sampling in Python

Jonny Hofmeister — Thu, 25 Nov 2021 22:22:34 +0000

An introduction to using Bayesian Inference and MCMC sampling methods to predict the distribution of unknown parameters through an in-depth coin-flip example implemented in Python.

Image from Adobe Stock

Introduction

This article extrapolates a basic coin-flip example into a larger context in which we can examine the use and power of Bayesian Inference and Markov Chain Monte Carlo sampling to predict unknown values. There are many useful packages to employ MCMC methods, but here we will build our own MCMC from scratch in Python with the goal of understanding the process at its core.

What is Bayesian Inference?

Bayesian inference is a method in which we use Bayes’ Theorem to update our understanding of a probability or a parameter as we gather more data and evidence.

What is Markov Chain Monte Carlo sampling?

The MCMC method (as it’s commonly referred to) is an algorithm used to sample from a probability distribution. This class of algorithms employs random sampling to achieve numerical results that converge on the truth as the number of samples increases. Many different algorithms exist that can be used to create this type of sampling chain – the one we will utilize for this specific example is called the Metropolis-Hastings algorithm. The MH implementation allows us to draw samples from an unknown distribution provided we know a function that is proportional to the probability density of the distribution we want to sample from. Not needing to know the exact probability density, but just its proportionality makes the Mcmc-MH especially useful. We will describe the detailed steps of the algorithm later in this example.

The Coin Factory Problem

This example will model the probability of the outcome of a coin flip but in an expanded context. Imagine that a brand new coin factory has just been constructed to manufacture a new kind of coin, and they have tasked you with determining a parameter of the coins the factory produces – the probability they land on heads/tails when flipped. We will call this the coin’s ‘bias’. Each coin will have its own bias, but given they are produced in the same process and vary only slightly, we expect the factory to produce coins not randomly, but around a ‘factory bias’. We will use both information from the specific coin and the factory bias to create a probability distribution to determine the bias a coin is most likely produced with. In Bayes’ terms, we will use the likelihood (coin data) and the prior (factory bias) to produce the posterior distribution for the coin bias.

Image from Adobe Stock

So, the factory opens up and produces its very first coin. Given it is a new kind of coin and mint, we don’t know much about what to expect for the way it behaves when flipped (pretend we don’t know how coins generally work, we are trying to predict an ‘unknown’ parameter). So, we run an experiment with this first coin by flipping it 100 times and recording the number of times it landed on heads – 57 times. We will do this with more coins as they are produced in order to update our understanding of the factory bias, but first, let us analyze just the first coin.

The Problem in a Bayesian Context

Before we get too far, we need to define the problem in the context of Bayes’ Theorem. The left side of the equation is called the posterior; generally, it is the probability of a hypothesis (H) given some evidence (E). In the numerator on the right side, we have our likelihood (the probability of seeing the evidence given our hypothesis is true), multiplied by the prior (the probability of the hypothesis). The denominator on the left is the marginal likelihood or evidence; this is the probability of observing the evidence. Fortunately, we will not need to use the marginal likelihood to sample the posterior.

Bayes’ Theorem – image by author.

In most practical applications, computing the posterior distribution directly is not possible, this is why we employ numerical sampling techniques like MCMC.

Posterior

So what is the posterior probability we are interested in for our coin factory? It is the probability, P, that the factory produces a coin with bias, p, given our data, x number of heads – P(p|x). The rest of the theorem can be written as follows:

Bayes’ Rule in the context of our coin factory – image by author.

It is important to remember here that we are predicting the posterior probability distribution (P(p|x)) that a coin has a certain probability of heads, or bias (p), these are two different things. The factory bais is the probability distribution of a coin being produced with a certain bias; this is P(p), the prior.

Likelihood – The Binomial Distribution

The likelihood function here is the probability of observing a heads, x, given a coin with bias p. For a coin toss, this function can be described precisely by the Binomial Distribution. For a coin with bias p, the probability of observing x heads out of n flips can be written:

Binomial Distribution – image by author

It is important to note that we do not currently know the value p for a given coin. This value is the one our MCMC will randomly sample. After initializing p with a random value, and through many samples, the model will converge towards the true value of p.

Let’s begin coding this example in Python. We need to define the observed data from our first coin experiment, and a likelihood function that gives a probability from a binomial distribution given our data. Scipy.stats is a great Python library for easily defining distributions like the binomial and is what I use for this example.

Prior

Next, we need to define our prior function, P(p). The value for p can only exist between 0 and 1, with 0 representing a coin that will never land heads, and 1 representing a coin that will always land heads. Remember, the factory has only produced one coin, so we don’t have any information on what to expect the prior probability of p to be (pretend we don’t know how coins work). In the context of the problem, with only one coin produced, we have no idea of what the factory bias may yet be. Because of this, we will use what is called a uniform prior. In the Bayesian Inference context, this says that we assign equal weight to the probability of p being any value between 0 and 1.

We can do this easily with Scipy and the uniform distribution PDF. This default distribution exists only from 0 to 1, just as we need.

Now that we have our likelihood and prior defined, we can move on to understanding and coding the Markov Chain.

The Metropolis-Hastings MCMC

As mentioned above, these methods draw samples from a continuous random variable – in our case for p. The MCMC we will use is of the random walk type, which randomly generates samples and keeps them or not based on how they fit the model.

Acceptance Ratio

The Metropolis-Hastings algorithm is fairly straightforward, but first, we need to define how we will either accept or reject the new sample draw. Each iteration, a new value for p between 0 and 1 will be proposed, we will call this proposed value p′. We only want to accept and update this value if it is better than the previous. This is done by computing an acceptance ratio, R. This acceptance ratio is the ratio of our Bayes’ Theorem for the proposed value over the previous value and is shown below.

Acceptance Ratio, R – image by author

There are a few things to note here. First, you may have noticed that this acceptance ratio does not include the marginal likelihood (evidence) part of Bayes’ Theorem, nor did we define a function for it above. This is because the evidence does not change for a new value of p, and thus cancels itself out in this ratio. This is awesome, as computing the marginal likelihood part of Bayes’ Theorem is usually extremely difficult or impossible in practice. MCMC and Bayesian Inference allow us to sample the posterior without needing to know the marginal likelihood!

Second, any value greater than 1 here means that the proposed value is better and should be accepted. The second part of accepting new values is comparing R to another random draw between 0 and 1, so it is convention to just set R to 1 when it is higher.

For simplicity, we will write a function that computes this acceptance ratio to easily implement in our sampling chain loop.

Let’s explicitly define the steps of the MH algorithm so this is more clear.

The Sampling Algorithm

We have already defined our functions for the likelihood, prior, and acceptance probability. The last thing we must do before looping is to initialize p with a random value within its range, 0 to 1.

Here are the steps of our Metropolis-Hastings algorithm:

Propose a new value of p randomly between 0 and 1, call it p′ (or p_new).
Compute the acceptance ratio, R.
Generate another uniform random number between 0 and 1, call it u.
If u < R, accept the new value and set p = p′. Otherwise, keep the current value of p.
Record the final value of p for this sample.
Repeat steps 1 through 5 many, many times.

Pretty straightforward. Before we code this up, let’s explore a couple of other common concepts utilized in MCMC.

Burn-in

MCMCs are initialized randomly and must converge towards the correct value, and this can often take quite a lot of samples. When plotting our results and posterior distribution, it is not effective to include these early samples before the model has converged. So we implement what is called a "burn-in", in which those first, less accurate samples are excluded. Burn-in for MCMCs is typically around 2000–5000 samples when the entire chain is around 10k–20k.

Lag

Another very important thing to consider with MCMCs is sample independence. A new sample here is often dependent on the previous one as occasionally we do not accept a new random value and keep the old. To address this problem, we implement what is called "lag". Lag is where rather than recording every sample, we record every other, or perhaps every fifth or tenth sample.

The Simulation

Great, we now have everything we need to write and run our MCMC.

Visualizing the results

MCMC results are typically plotted in two ways – the posterior distribution and the traceplot. The posterior can be shown in a histogram in which we can examine the most likely value and the variance visually. A traceplot shows the value of p for each sample iteration and shows the behavior and convergence of the MCMC. Posterior distributions should never include burn-in samples, but including burn-in for traceplots can help us to examine where the model started and how well it converged. Burn-in samples are excluded from the traceplot below.

Posterior Distribution and Traceplot – image by author.

Examining the results, we can see that our posterior is normally distributed. We can also see from the traceplot that we sampled well randomly around the convergence value – this is good. When extracting a value for the posterior, using the last value for p is not always accurate with the rest of the distribution. For this reason, the value for the posterior is usually taken as the mean of the distribution. In this case, our mean posterior value is 0.569.

This value is nearly what we would have predicted if took the frequency probability of our data for this coin, 57/100 heads. The reason our MCMC predicts this is because it is the first coin, and we don’t have much knowledge about how they should behave. With a uniform prior, the posterior is more influenced by the likelihood function, i.e., the data. The next step in Bayesian Inference is to update our understanding with more data, so let’s keep the factory running and make more coins to test.

Photo by GSJJ on Unsplash

Updating Our Understanding

We wait a while, and the factory produces 500 coins. We run the same 100 flip experiments and record the posterior’s most likely bias for each coin. Let’s plot a histogram and examine the spread of biases produced to learn about the factory bias.

500 Coin Factory Bias – image by author.

Examining this data, we learn that the mean bias is 0.511 and the standard deviation is 0.049. The data appears normal, so let’s model it with a normal distribution with these parameters – this is shown in red.

This distribution contains information about which biases we expect to be more likely for coins produced at this factory. Updating our understanding like this will give us a much more accurate result for the bias of a coin than our previous, uninformed, uniform prior. This is exactly what Bayesian Inference is built for – we can simply update our prior function to represent the data of the factory bias.

Now that we have improved our model with more prior information, let’s produce the 501st coin and run the same experiment. In this experiment, we get 63 heads out of 100 tosses. We must build our likelihood function off this data as we did before:

Great, now let’s see how including information about the factory bias affects the result of the MCMC and what we think the true bias of this coin to be. I have run an MCMC with the same number of samples and parameters as above, but with the updated prior and new coin data. Below shows the posterior distribution and traceplot for the 501st coin.

Posterior and Trace from MCMC with updated Prior – updated by author.

Even though the data for this coin suggests a bias around 0.63, our Bayesian Inference model suggests the actual value is much closer to 0.53. This is because our informed prior function holds weight in the model, telling us that even if we observed 63 heads for this coin, given that coins are produced with an average bias of around 0.51, we expect the bias for the 501st to be closer to the factory bias. Even if this coin had exactly a 0.5 bias, observing 63 heads out of 100 tosses is not entirely unlikely, and we shouldn’t assume this data to be representative of the exact value. The prior function holds weight just like the likelihood does in informing the posterior distribution. If we were to produce thousands of more coins and inform the prior distribution even more, this would give it an even higher weight in the model. This idea of updating our understanding with more information to predict unknown parameters is exactly why Bayesian Inference is useful. It is tuning and manipulating these likelihood and prior functions with more and better data that allows us to improve and inform our inference model.

Conclusion

When implementing Bayesian Inference and MCMC, one does not usually expect to know what the posterior distribution will look like, as we do here because we made up the data. This is a powerful tool for the inference of unknown parameters – one only needs to know what the posterior distribution is proportional to (eg. a linear model and predicting parameters slope and intercept, or a logistic model and parameters alpha and beta). This allows us to do things like predicting the results of high dimensional integrals we cannot solve – believe it or not, you can do it with just randomness and statistics!

References

Besides the Wikipedia links given above, this article was very helpful in combing the context of Bayesian Inference and MCMC –Introduction to MCMC using RevBayes.

Moving forward, I recommend utilizing packages like PyMC3 to do Bayesian Inference, rather than coding from scratch. This article by Will Koehrsen provides an awesome real-world example, it is worth checking out:

Markov Chain Monte Carlo in Python

If you wish to dive deeper into the math and reasoning that makes Bayesian Inference and MCMC possible, I highly recommend this article – Bayesian Inference Problem, MCMC and Variational Inference.

If you’re looking for more information on my implementation or would like to give feedback (always appreciated), please shoot me a message or email at jonny.hofmeister@gmail.com.

Thank you!

The post Bayesian Inference and Markov Chain Monte Carlo Sampling in Python appeared first on Towards Data Science.

How to generate a vector of random numbers on a GPU

Filippo Valle — Tue, 13 Jul 2021 10:09:05 +0000

Photo by Timothy Dykes on Unsplash

Random numbers are an essential part of the life of a data scientist. You must generate them, for instance, if you need matrices that initialise neural-networks, to perform data analysis using null-models or even when running Monte Carlo simulations.

Imagine you have to generate a lot of independent sequences of random numbers, lets’ say, for instance, a set of Monte Carlo Markov Chains(MCMC). In this kind of problems, given a starting point and a rule to generate the state n+1 given the state n it is possible to sample many state of a given distribution. It is also possible to use MCMC to perform Natural Language Processing [1].

The ideal hardware to perform the same task on multiple independent objects is a Graphical Processing Unit. GPUs can, indeed, be programmed to run multiple independent threads which performs the same operation concurrently.

Here I will describe a brief example on how to generate sequences of random numbers using the Apple Metal Framework to run operations on fast GPU hardware.

A vector of random numbers

The idea here is to generate a vector of independent random numbers uniformly distributed. Each number could represent a state at the end of a Markov Chain for instance.

The code is composed in three part:

a vector of initial states (seeds) is prepared
the vector is passed to the GPU and each thread run a chain and output a single number (state)
the states are copied from GPU to CPU and then to a file on the disk to be analysed later on.

Running on GPU

I choose to take advantage of Apple Metal Framework to perform the operations described above and to perform operations on the GPU. The whole code can be easily translated to other languages such as OpenCL or CUDA.

Metal – Apple Developer

To execute the pipeline against our data on Metal, we need to create a command buffer, write commands into it, and commit the buffer to a command queue. Metal will send the commands to the GPU to be executed.

In this example we will use the ComputeCommandBuffer, since our task is a computation task and not a graphical task.

Our program, written in C++, wrapping [2] the metal code will run efficiently on the GPU built-in in the Mac.

GPU usage statistic during computation. Image by Author.

Algorithms

The generation of (pseudo) random numbers starts with an integer value called seed. Then, there are different approaches to generate subsequent random numbers; Mersenne Twister and xorshift are two examples.

Mersenne Twister algorithm

The Mersenne Twister is a pseudorandom number generator and it is by far the most widely used general-purpose. It was developed in 1997 by Makoto Matsumoto and Takuji Nishimura and it was designed specifically to rectify most of the flaws found in older generators.

The most commonly used version of the Mersenne Twister algorithm is based on the Mersenne prime 2¹⁹⁹³⁷-1. The standard implementation of that, MT19937, uses a 32-bit word length.

In this example I will use a Metal implementation of Mersenne Twister.

Marsaglia’s xorshift

Another easy to implement algorithm to generate random numbers is xorshift [3]. It is a very fast type of generator. Its creator also suggested as an improvement the xorwow generator, in which the output of a xorshift generator is added with a Weyl sequence. The xorwow generator is the default generator in the CURAND library of the nVidia CUDA application programming interface for graphics processing units.

A GPU implementation of xorshift is the following.

uint rand_xorshift(uint rng_state){    
    rng_state ^= (rng_state << 13);
    rng_state ^= (rng_state >> 17);    
    rng_state ^= (rng_state << 5);    
    return rng_state;
}

This can be easily portable to other GPU programming languages such as OpenCL.

Examples and runs

In the example I will describe the threads should generate random numbers drawn from an uniform distribution with bounds 0 and 1000.

The Metal code shown below shows an example of the process of generating 100 random numbers on each GPU thread and store the last one to an array (vOut) which will be later accessed by the CPU code.

Each thread has its one seed passed by the main CPU thread through the vIn array. Then each and every thread create its own generator

#include

#include "mersenne.metal"
using namespace metal;
kernel void storyMersenne(
const device uint *vIn [[ buffer(0) ]],
device int *vOut [[ buffer(1) ]],
uint id[[ thread_position_in_grid ]]){

mt19937 generator;

generator(vIn[id]);

uint r;
for (uint l = 0; l < 100; l++) r = generator.rand();

vOut[id] = r;

When each thread of the GPU has generated its random numbers, it is possible to look at their distribution and verify whether they are uniformly distributed as expected.

The figure shows that yes, they are well uniformly distributed.

What if there are multiple runs

Another criteria one can use to evaluate a random number generator tells that there should be a high probability that generated sequences of random numbers are different from each other.

I tested this fact generating multiple sequences of random numbers and studying the correlations between them. The figure shows that points generated in different runs are effectively not correlated.

Correlation of different runs. Image by Author.

Conclusions

There are many applications in which having a good source of random numbers is necessary. Generating them fast would be an advantage.

When you have to do computations or Markov Chain Monte Carlo involving sequences of random numbers, as far as they are independent and knowing that GPU hardware is optimised to perform many operations in parallel, it could be a good option to generate them using algorithms implemented in Metal or OpenCL.

A simple example shows that it is possible to generate uniformly distributed and uncorrelated random numbers easily taking advantage of a MacBook GPU.

References

[1] Gerlach, M., Pexioto T.P., Altmann E.G. A network approach to topic models. (2018)

[2] Jason R. Blevins C++ Mersenne Twister wrapper class (2006).

[3] Marsaglia, George. Random number generators. (2003) Journal of Modern Applied Statistical Methods.

The post How to generate a vector of random numbers on a GPU appeared first on Towards Data Science.