Are You Sure Your Posterior Makes Sense?

Felipe Bandeira — Fri, 11 Apr 2025 18:38:41 +0000

This article is co-authored by Felipe Bandeira, Giselle Fretta, Thu Than, and Elbion Redenica. We also thank Prof. Carl Scheffler for his support.

Introduction

Parameter estimation has been for decades one of the most important topics in statistics. While frequentist approaches, such as Maximum Likelihood Estimations, used to be the gold standard, the advance of computation has opened space for Bayesian methods. Estimating posterior distributions with Mcmc samplers became increasingly common, but reliable inferences depend on a task that is far from trivial: making sure that the sampler — and the processes it executes under the hood — worked as expected. Keeping in mind what Lewis Caroll once wrote: “If you don’t know where you’re going, any road will take you there.”

This article is meant to help data scientists evaluate an often overlooked aspect of Bayesian parameter estimation: the reliability of the sampling process. Throughout the sections, we combine simple analogies with technical rigor to ensure our explanations are accessible to data scientists with any level of familiarity with Bayesian methods. Although our implementations are in Python with PyMC, the concepts we cover are useful to anyone using an MCMC algorithm, from Metropolis-Hastings to NUTS.

Key Concepts

No data scientist or statistician would disagree with the importance of robust parameter estimation methods. Whether the objective is to make inferences or conduct simulations, having the capacity to model the data generation process is a crucial part of the process. For a long time, the estimations were mainly performed using frequentist tools, such as Maximum Likelihood Estimations (MLE) or even the famous Least Squares optimization used in regressions. Yet, frequentist methods have clear shortcomings, such as the fact that they are focused on point estimates and do not incorporate prior knowledge that could improve estimates.

As an alternative to these tools, Bayesian methods have gained popularity over the past decades. They provide statisticians not only with point estimates of the unknown parameter but also with confidence intervals for it, all of which are informed by the data and by the prior knowledge researchers held. Originally, Bayesian parameter estimation was done through an adapted version of Bayes’ theorem focused on unknown parameters (represented as θ) and known data points (represented as x). We can define P(θ|x), the posterior distribution of a parameter’s value given the data, as:

\[ P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)} \]

In this formula, P(x|θ) is the likelihood of the data given a parameter value, P(θ) is the prior distribution over the parameter, and P(x) is the evidence, which is computed by integrating all possible values of the prior:

\[ P(x) = \int_\theta P(x, \theta) d\theta \]

In some cases, due to the complexity of the calculations required, deriving the posterior distribution analytically was not possible. However, with the advance of computation, running sampling algorithms (especially MCMC ones) to estimate posterior distributions has become easier, giving researchers a powerful tool for situations where analytical posteriors are not trivial to find. Yet, with such power also comes a large amount of responsibility to ensure that results make sense. This is where sampler diagnostics come in, offering a set of valuable tools to gauge 1) whether an MCMC algorithm is working well and, consequently, 2) whether the estimated distribution we see is an accurate representation of the real posterior distribution. But how can we know so?

How samplers work

Before diving into the technicalities of diagnostics, we shall cover how the process of sampling a posterior (especially with an MCMC sampler) works. In simple terms, we can think of a posterior distribution as a geographical area we haven’t been to but need to know the topography of. How can we draw an accurate map of the region?

One of our favorite analogies comes from Ben Gilbert. Suppose that the unknown region is actually a house whose floorplan we wish to map. For some reason, we cannot directly visit the house, but we can send bees inside with GPS devices attached to them. If everything works as expected, the bees will fly around the house, and using their trajectories, we can estimate what the floor plan looks like. In this analogy, the floor plan is the posterior distribution, and the sampler is the group of bees flying around the house.

The reason we are writing this article is that, in some cases, the bees won’t fly as expected. If they get stuck in a certain room for some reason (because someone dropped sugar on the floor, for example), the data they return won’t be representative of the entire house; rather than visiting all rooms, the bees only visited a few, and our picture of what the house looks like will ultimately be incomplete. Similarly, when a sampler does not work correctly, our estimation of the posterior distribution is also incomplete, and any inference we draw based on it is likely to be wrong.

Monte Carlo Markov Chain (MCMC)

In technical terms, we call an MCMC process any algorithm that undergoes transitions from one state to another with certain properties. Markov Chain refers to the fact that the next state only depends on the current one (or that the bee’s next location is only influenced by its current place, and not by all of the places where it has been before). Monte Carlo means that the next state is chosen randomly. MCMC methods like Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo (HMC), and No-U-Turn Sampler (NUTS) all operate by constructing Markov Chains (a sequence of steps) that are close to random and gradually explore the posterior distribution.

Now that you understand how a sampler works, let’s dive into a practical scenario to help us explore sampling problems.

Case Study

Imagine that, in a faraway nation, a governor wants to understand more about public annual spending on healthcare by mayors of cities with less than 1 million inhabitants. Rather than looking at sheer frequencies, he wants to understand the underlying distribution explaining expenditure, and a sample of spending data is about to arrive. The problem is that two of the economists involved in the project disagree about how the model should look.

Model 1

The first economist believes that all cities spend similarly, with some variation around a certain mean. As such, he creates a simple model. Although the specifics of how the economist chose his priors are irrelevant to us, we do need to keep in mind that he is trying to approximate a Normal (unimodal) distribution.

\[
x_i \sim \text{Normal}(\mu, \sigma^2) \text{ i.i.d. for all } i \\
\mu \sim \text{Normal}(10, 2) \\
\sigma^2 \sim \text{Uniform}(0,5)
\]

Model 2

The second economist disagrees, arguing that spending is more complex than his colleague believes. He believes that, given ideological differences and budget constraints, there are two kinds of cities: the ones that do their best to spend very little and the ones that are not afraid of spending a lot. As such, he creates a slightly more complex model, using a mixture of normals to reflect his belief that the true distribution is bimodal.

\[
x_i \sim \text{Normal-Mixture}([\omega, 1-\omega], [m_1, m_2], [s_1^2, s_2^2]) \text{ i.i.d. for all } i\\
m_j \sim \text{Normal}(2.3, 0.5^2) \text{ for } j = 1,2 \\
s_j^2 \sim \text{Inverse-Gamma}(1,1) \text{ for } j=1,2 \\
\omega \sim \text{Beta}(1,1)
\]

After the data arrives, each economist runs an MCMC algorithm to estimate their desired posteriors, which will be a reflection of reality (1) if their assumptions are true and (2) if the sampler worked correctly. The first if, a discussion about assumptions, shall be left to the economists. However, how can they know whether the second if holds? In other words, how can they be sure that the sampler worked correctly and, as a consequence, their posterior estimations are unbiased?

Sampler Diagnostics

To evaluate a sampler’s performance, we can explore a small set of metrics that reflect different parts of the estimation process.

Quantitative Metrics

R-hat (Potential Scale Reduction Factor)

In simple terms, R-hat evaluates whether bees that started at different places have all explored the same rooms at the end of the day. To estimate the posterior, an MCMC algorithm uses multiple chains (or bees) that start at random locations. R-hat is the metric we use to assess the convergence of the chains. It measures whether multiple MCMC chains have mixed well (i.e., if they have sampled the same topography) by comparing the variance of samples within each chain to the variance of the sample means across chains. Intuitively, this means that

\[
\hat{R} = \sqrt{\frac{\text{Variance Between Chains}}{\text{Variance Within Chains}}}
\]

If R-hat is close to 1.0 (or below 1.01), it means that the variance within each chain is very similar to the variance between chains, suggesting that they have converged to the same distribution. In other words, the chains are behaving similarly and are also indistinguishable from one another. This is precisely what we see after sampling the posterior of the first model, shown in the last column of the table below:

Figure 1. Summary statistics of the sampler highlighting ideal R-hats.

The r-hat from the second model, however, tells a different story. The fact we have such large r-hat values indicates that, at the end of the sampling process, the different chains had not converged yet. In practice, this means that the distribution they explored and returned was different, or that each bee created a map of a different room of the house. This fundamentally leaves us without a clue of how the pieces connect or what the complete floor plan looks like.

Figure 2. Summary statistics of the sampler showcasing problematic R-hats.

Given our R-hat readouts were large, we know something went wrong with the sampling process in the second model. However, even if the R-hat had turned out within acceptable levels, this does not give us certainty that the sampling process worked. R-hat is just a diagnostic tool, not a guarantee. Sometimes, even if your R-hat readout is lower than 1.01, the sampler might not have properly explored the full posterior. This happens when multiple bees start their exploration in the same room and remain there. Likewise, if you’re using a small number of chains, and if your posterior happens to be multimodal, there is a probability that all chains started in the same mode and failed to explore other peaks.

The R-hat readout reflects convergence, not completion. In order to have a more comprehensive idea, we need to check other diagnostic metrics as well.

Effective Sample Size (ESS)

When explaining what MCMC was, we mentioned that “Monte Carlo” refers to the fact that the next state is chosen randomly. This does not necessarily mean that the states are fully independent. Even though the bees choose their next step at random, these steps are still correlated to some extent. If a bee is exploring a living room at time t=0, it will probably still be in the living room at time t=1, even though it is in a different part of the same room. Due to this natural connection between samples, we say these two data points are autocorrelated.

Due to their nature, MCMC methods inherently produce autocorrelated samples, which complicates statistical analysis and requires careful evaluation. In statistical inference, we often assume independent samples to ensure that the estimates of uncertainty are accurate, hence the need for uncorrelated samples. If two data points are too similar to each other, the correlation reduces their effective information content. Mathematically, the formula below represents the autocorrelation function between two time points (t1 and t2) in a random process:

\[
R_{XX}(t_1, t_2) = E[X_{t_1} \overline{X_{t_2}}]
\]

where E is the expected value operator and X-bar is the complex conjugate. In MCMC sampling, this is crucial because high autocorrelation means that new samples don’t teach us anything different from the old ones, effectively reducing the sample size we have. Unsurprisingly, the metric that reflects this is called Effective Sample Size (ESS), and it helps us determine how many truly independent samples we have.

As hinted previously, the effective sample size accounts for autocorrelation by estimating how many truly independent samples would provide the same information as the autocorrelated samples we have. Mathematically, for a parameter θ, the ESS is defined as:

\[
ESS = \frac{n}{1 + 2 \sum_{k=1}^{\infty} \rho(\theta)_k}
\]

where n is the total number of samples and ρ(θ)k is the autocorrelation at lag k for parameter θ.

Typically, for ESS readouts, the higher, the better. This is what we see in the readout for the first model. Two common ESS variations are Bulk-ESS, which assesses mixing in the central part of the distribution, and Tail-ESS, which focuses on the efficiency of sampling the distribution’s tails. Both inform us if our model accurately reflects the central tendency and credible intervals.

Figure 3. Summary statistics of the sampler highlighting ideal quantities for ESS bulk and tail.

In contrast, the readouts for the second model are very bad. Typically, we want to see readouts that are at least 1/10 of the total sample size. In this case, given each chain sampled 2000 observations, we should expect ESS readouts of at least 800 (from the total size of 8000 samples across 4 chains of 2000 samples each), which is not what we observe.

Figure 4. Summary statistics of the sampler demonstrating problematic ESS bulk and tail.

Visual Diagnostics

Apart from the numerical metrics, our understanding of sampler performance can be deepened through the use of diagnostic plots. The main ones are rank plots, trace plots, and pair plots.

Rank Plots

A rank plot helps us identify whether the different chains have explored all of the posterior distribution. If we once again think of the bee analogy, rank plots tell us which bees explored which parts of the house. Therefore, to evaluate whether the posterior was explored equally by all chains, we observe the shape of the rank plots produced by the sampler. Ideally, we want the distribution of all chains to look roughly uniform, like in the rank plots generated after sampling the first model. Each color below represents a chain (or bee):

Figure 5. Rank plots for parameters ‘m’ and ‘s’ across four MCMC chains. Each bar represents the distribution of rank values for one chain, with ideally uniform ranks indicating good mixing and proper convergence.

Under the hood, a rank plot is produced with a simple sequence of steps. First, we run the sampler and let it sample from the posterior of each parameter. In our case, we are sampling posteriors for parameters m and s of the first model. Then, parameter by parameter, we get all samples from all chains, put them together, and order them from smallest to largest. We then ask ourselves, for each sample, what was the chain where it came from? This will allow us to create plots like the ones we see above.

In contrast, bad rank plots are easy to spot. Unlike the previous example, the distributions from the second model, shown below, are not uniform. From the plots, what we interpret is that each chain, after beginning at different random locations, got stuck in a region and did not explore the entirety of the posterior. Consequently, we cannot make inferences from the results, as they are unreliable and not representative of the true posterior distribution. This would be equivalent to having four bees that started at different rooms of the house and got stuck somewhere during their exploration, never covering the entirety of the property.

Figure 6. Rank plots for parameters m, s_squared, and w across four MCMC chains. Each subplot shows the distribution of ranks by chain. There are noticeable deviations from uniformity (e.g., stair-step patterns or imbalances across chains) suggesting potential sampling issues.

KDE and Trace Plots

Similar to R-hat, trace plots help us assess the convergence of MCMC samples by visualizing how the algorithm explores the parameter space over time. PyMC provides two types of trace plots to diagnose mixing issues: Kernel Density Estimate (KDE) plots and iteration-based trace plots. Each of these serves a distinct purpose in evaluating whether the sampler has properly explored the target distribution.

The KDE plot (usually on the left) estimates the posterior density for each chain, where each line represents a separate chain. This allows us to check whether all chains have converged to the same distribution. If the KDEs overlap, it suggests that the chains are sampling from the same posterior and that mixing has occurred. On the other hand, the trace plot (usually on the right) visualizes how parameter values change over MCMC iterations (steps), with each line representing a different chain. A well-mixed sampler will produce trace plots that look noisy and random, with no clear structure or separation between chains.

Using the bee analogy, trace plots can be thought of as snapshots of the “features” of the house at different locations. If the sampler is working correctly, the KDEs in the left plot should align closely, showing that all bees (chains) have explored the house similarly. Meanwhile, the right plot should show highly variable traces that blend together, confirming that the chains are actively moving through the space rather than getting stuck in specific regions.

Figure 7. Density and trace plots for parameters m and s from the first model across four MCMC chains. The left panel shows kernel density estimates (KDE) of the marginal posterior distribution for each chain, indicating consistent central tendency and spread. The right panel displays the trace plot over iterations, with overlapping chains and no apparent divergences, suggesting good mixing and convergence.

However, if your sampler has poor mixing or convergence issues, you will see something like the figure below. In this case, the KDEs will not overlap, meaning that different chains have sampled from different distributions rather than a shared posterior. The trace plot will also show structured patterns instead of random noise, indicating that chains are stuck in different regions of the parameter space and failing to fully explore it.

Figure 8. KDE (left) and trace plots (right) for parameters m, s_squared, and w across MCMC chains for the second model. Multimodal distributions are visible for m and w, suggesting potential identifiability issues. Trace plots reveal that chains explore different modes with limited mixing, particularly for m, highlighting challenges in convergence and effective sampling.

By using trace plots alongside the other diagnostics, you can identify sampling issues and determine whether your MCMC algorithm is effectively exploring the posterior distribution.

Pair Plots

A third kind of plot that is often useful for diagnostic are pair plots. In models where we want to estimate the posterior distribution of multiple parameters, pair plots allow us to observe how different parameters are correlated. To understand how such plots are formed, think again about the bee analogy. If you imagine that we’ll create a plot with the width and length of the house, each “step” that the bees take can be represented by an (x, y) combination. Likewise, each parameter of the posterior is represented as a dimension, and we create scatter plots showing where the sampler walked using parameter values as coordinates. Here, we are plotting each unique pair (x, y), resulting in the scatter plot you see in the middle of the image below. The one-dimensional plots you see on the edges are the marginal distributions over each parameter, giving us additional information on the sampler’s behavior when exploring them.

Take a look at the pair plot from the first model.

Figure 9. Joint posterior distribution of parameters m and s, with marginal densities. The scatter plot shows a roughly symmetric, elliptical shape, suggesting a low correlation between m and s.

Each axis represents one of the two parameters whose posteriors we are estimating. For now, let’s focus on the scatter plot in the middle, which shows the parameter combinations sampled from the posterior. The fact we have a very even distribution means that, for any particular value of m, there was a range of values of s that were equally likely to be sampled. Additionally, we don’t see any correlation between the two parameters, which is usually good! There are cases when we would expect some correlation, such as when our model involves a regression line. However, in this instance, we have no reason to believe two parameters should be highly correlated, so the fact we don’t observe unusual behavior is positive news.

Now, take a look at the pair plots from the second model.

Figure 10. Pair plot of the joint posterior distributions for parameters m, s_squared, and w. The scatter plots reveal strong correlations between several parameters.

Given that this model has five parameters to be estimated, we naturally have a greater number of plots since we are analyzing them pair-wise. However, they look odd compared to the previous example. Namely, rather than having an even distribution of points, the samples here either seem to be divided across two regions or seem somewhat correlated. This is another way of visualizing what the rank plots have shown: the sampler did not explore the full posterior distribution. Below we isolated the top left plot, which contains the samples from m0 and m1. Unlike the plot from model 1, here we see that the value of one parameter greatly influences the value of the other. If we sampled m1 around 2.5, for example, m0 is likely to be sampled from a very narrow range around 1.5.

Figure 11. Joint posterior distribution of parameters m₀ and m₁, with marginal densities.

Certain shapes can be observed in problematic pair plots relatively frequently. Diagonal patterns, for example, indicate a high correlation between parameters. Banana shapes are often connected to parametrization issues, often being present in models with tight priors or constrained parameters. Funnel shapes might indicate hierarchical models with bad geometry. When we have two separate islands, like in the plot above, this can indicate that the posterior is bimodal AND that the chains haven’t mixed well. However, keep in mind that these shapes might indicate problems, but not necessarily do so. It’s up to the data scientist to examine the model and determine which behaviors are expected and which ones are not!

Some Fixing Techniques

When your diagnostics indicate sampling problems — whether concerning R-hat values, low ESS, unusual rank plots, separated trace plots, or strange parameter correlations in pair plots — several strategies can help you address the underlying issues. Sampling problems typically stem from the target posterior being too complex for the sampler to explore efficiently. Complex target distributions might have:

Multiple modes (peaks) that the sampler struggles to move between
Irregular shapes with narrow “corridors” connecting different regions
Areas of drastically different scales (like the “neck” of a funnel)
Heavy tails that are difficult to sample accurately

In the bee analogy, these complexities represent houses with unusual floor plans — disconnected rooms, extremely narrow hallways, or areas that change dramatically in size. Just as bees might get trapped in specific regions of such houses, MCMC chains can get stuck in certain areas of the posterior.

Figure 12. Examples of multimodal target distributions.

Figure 13. Examples of weirdly shaped distributions.

To help the sampler in its exploration, there are simple strategies we can use.

Strategy 1: Reparameterization

Reparameterization is particularly effective for hierarchical models and distributions with challenging geometries. It involves transforming your model’s parameters to make them easier to sample. Back to the bee analogy, imagine the bees are exploring a house with a peculiar layout: a spacious living room that connects to the kitchen through a very, very narrow hallway. One aspect we hadn’t mentioned before is that the bees have to fly in the same way through the entire house. That means that if we dictate the bees should use large “steps,” they will explore the living room very well but hit the walls in the hallway head-on. Likewise, if their steps are small, they will explore the narrow hallway well, but take forever to cover the entire living room. The difference in scales, which is natural to the house, makes the bees’ job more difficult.

A classic example that represents this scenario is Neal’s funnel, where the scale of one parameter depends on another:

\[
p(y, x) = \text{Normal}(y|0, 3) \times \prod_{n=1}^{9} \text{Normal}(x_n|0, e^{y/2})
\]

Figure 14. Log the marginal density of y and the first dimension of Neal’s funnel. The neck is where the sampler is struggling to sample from and the step size is required to be much smaller than the body. (Image source: Stan User’s Guide)

We can see that the scale of x is dependent on the value of y. To fix this problem, we can separate x and y as independent standard Normals and then transform these variables into the desired funnel distribution. Instead of sampling directly like this:

\[
\begin{align*}
y &\sim \text{Normal}(0, 3) \\
x &\sim \text{Normal}(0, e^{y/2})
\end{align*}
\]

You can reparameterize to sample from standard Normals first:

\[
y_{raw} \sim \text{Standard Normal}(0, 1) \\
x_{raw} \sim \text{Standard Normal}(0, 1) \\
\\
y = 3y_{raw} \\
x = e^{y/2} x_{raw}
\]

This technique separates the hierarchical parameters and makes sampling more efficient by eliminating the dependency between them.

Reparameterization is like redesigning the house such that instead of forcing the bees to find a single narrow hallway, we create a new layout where all passages have similar widths. This helps the bees use a consistent flying pattern throughout their exploration.

Strategy 2: Handling Heavy-tailed Distributions

Heavy-tailed distributions like Cauchy and Student-T present challenges for samplers and the ideal step size. Their tails require larger step sizes than their central regions (similar to very long hallways that require the bees to travel long distances), which creates a challenge:

Small step sizes lead to inefficient sampling in the tails
Large step sizes cause too many rejections in the center

Figure 15. Probability density functions for various Cauchy distributions illustrate the effects of changing the location parameter and scale parameter. (Image source: Wikipedia)

Reparameterization solutions include:

For Cauchy: Defining the variable as a transformation of a Uniform distribution using the Cauchy inverse CDF
For Student-T: Using a Gamma-Mixture representation

Strategy 3: Hyperparameter Tuning

Sometimes the solution lies in adjusting the sampler’s hyperparameters:

Increase total iterations: The simplest approach — give the sampler more time to explore.
Increase target acceptance rate (adapt_delta): Reduce divergent transitions (try 0.9 instead of the default 0.8 for complex models, for example).
Increase max_treedepth: Allow the sampler to take more steps per iteration.
Extend warmup/adaptation phase: Give the sampler more time to adapt to the posterior geometry.

Remember that while these adjustments may improve your diagnostic metrics, they often treat symptoms rather than underlying causes. The previous strategies (reparameterization and better proposal distributions) typically offer more fundamental solutions.

Strategy 4: Better Proposal Distributions

This solution is for function fitting processes, rather than sampling estimations of the posterior. It basically asks the question: “I’m currently here in this landscape. Where should I jump to next so that I explore the full landscape, or how do I know that the next jump is the jump I should make?” Thus, choosing a good distribution means making sure that the sampling process explores the full parameter space instead of just a specific region. A good proposal distribution should:

Have substantial probability mass where the target distribution does.
Allow the sampler to make jumps of the appropriate size.

One common choice of the proposal distribution is the Gaussian (Normal) distribution with mean μ and standard deviation σ — the scale of the distribution that we can tune to decide how far to jump from the current position to the next position. If we choose the scale for the proposal distribution to be too small, it might either take too long to explore the entire posterior or it will get stuck in a region and never explore the full distribution. But if the scale is too large, you might never get to explore some regions, jumping over them. It’s like playing ping-pong where we only reach the two edges but not the middle.

Improve Prior Specification

When all else fails, reconsider your model’s prior specifications. Vague or weakly informative priors (like uniformly distributed priors) can sometimes lead to sampling difficulties. More informative priors, when justified by domain knowledge, can help guide the sampler toward more reasonable regions of the parameter space. Sometimes, despite your best efforts, a model may remain challenging to sample effectively. In such cases, consider whether a simpler model might achieve similar inferential goals while being more computationally tractable. The best model is often not the most complex one, but the one that balances complexity with reliability. The table below shows the summary of fixing strategies for different issues.

Diagnostic Signal	Potential Issue	Recommended Fix
High R-hat	Poor mixing between chains	Increase iterations, adjust the step size
Low ESS	High autocorrelation	Reparameterization, increase adapt_delta
Non-uniform rank plots	Chains stuck in different regions	Better proposal distribution, start with multiple chains
Separated KDEs in trace plots	Chains exploring different distributions	Reparameterization
Funnel shapes in pair plots	Hierarchical model issues	Non-centered reparameterization
Disjoint clusters in pair plots	Multimodality with poor mixing	Adjusted distribution, simulated annealing

Conclusion

Assessing the quality of MCMC sampling is crucial for ensuring reliable inference. In this article, we explored key diagnostic metrics such as R-hat, ESS, rank plots, trace plots, and pair plots, discussing how each helps determine whether the sampler is performing properly.

If there’s one takeaway we want you to keep in mind it’s that you should always run diagnostics before drawing conclusions from your samples. No single metric provides a definitive answer — each serves as a tool that highlights potential issues rather than proving convergence. When problems arise, strategies such as reparameterization, hyperparameter tuning, and prior specification can help improve sampling efficiency.

By combining these diagnostics with thoughtful modeling decisions, you can ensure a more robust analysis, reducing the risk of misleading inferences due to poor sampling behavior.

References

B. Gilbert, Bob’s bees: the importance of using multiple bees (chains) to judge MCMC convergence (2018), Youtube

Chi-Feng, MCMC demo (n.d.), GitHub

D. Simpson, Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it. (2019), Statistical Modeling, Causal Inference, and Social Science

M. Taboga, Markov Chain Monte Carlo (MCMC) methods (2021), Lectures on probability theory and mathematical Statistics. Kindle Direct Publishing.

T. Wiecki, MCMC Sampling for Dummies (2024), twecki.io
Stan User’s Guide, Reparametrization (n.d.), Stan Documentation

The post Are You Sure Your Posterior Makes Sense? appeared first on Towards Data Science.

Ivory Tower Notes: The Problem

Marina Tosic — Thu, 10 Apr 2025 18:48:08 +0000

Did you ever spend months on a Machine Learning project, only to discover you never defined the “correct” problem at the start? If so, or even if not, and you are only starting with the data science or AI field, welcome to my first Ivory Tower Note, where I will address this topic.

The term “Ivory Tower” is a metaphor for a situation in which someone is isolated from the practical realities of everyday life. In academia, the term often refers to researchers who engage deeply in theoretical pursuits and remain distant from the realities that practitioners face outside academia.

As a former researcher, I wrote a short series of posts from my old Ivory Tower notes — the notes before the LLM era.

Scary, I know. I am writing this to manage expectations and the question, “Why ever did you do things this way?” — “Because no LLM told me how to do otherwise 10+ years ago.”

That’s why my notes contain “legacy” topics such as data mining, machine learning, multi-criteria decision-making, and (sometimes) human interactions, airplanes and art.

Nonetheless, whenever there is an opportunity, I will map my “old” knowledge to generative AI advances and explain how I applied it to datasets beyond the Ivory Tower.

Welcome to post #1…

How every Machine Learning and AI journey starts

— It starts with a problem.

For you, this is usually “the” problem because you need to live with it for months or, in the case of research, years.

With “the” problem, I am addressing the business problem you don’t fully understand or know how to solve at first.

An even worse scenario is when you think you fully understand and know how to solve it quickly. This then creates only more problems that are again only yours to solve. But more about this in the upcoming sections.

So, what’s “the” problem about?

Causa: It’s mostly about not managing or leveraging resources properly — workforce, equipment, money, or time.

Ratio: It’s usually about generating business value, which can span from improved accuracy, increased productivity, cost savings, revenue gains, faster reaction, decision, planning, delivery or turnaround times.

Veritas: It’s always about finding a solution that relies and is hidden somewhere in the existing dataset.

Or, more than one dataset that someone labelled as “the one”, and that’s waiting for you to solve the problem. Because datasets follow and are created from technical or business process logs, “there has to be a solution lying somewhere within them.”

Ah, if only it were so easy.

Avoiding a different chain of thought again, the point is you will need to:

1 — Understand the problem fully,
2 — If not given, find the dataset “behind” it, and
3 — Create a methodology to get to the solution that will generate business value from it.

On this path, you will be tracked and measured, and time will not be on your side to deliver the solution that will solve “the universe equation.”

That’s why you will need to approach the problem methodologically, drill down to smaller problems first, and focus entirely on them because they are the root cause of the overall problem.

That’s why it’s good to learn how to…

Think like a Data Scientist.

Returning to the problem itself, let’s imagine that you are a tourist lost somewhere in the big museum, and you want to figure out where you are. What you do next is walk to the closest info map on the floor, which will show your current location.

At this moment, in front of you, you see something like this:

Data Science Process. Image by Author, inspired by Microsoft Learn

The next thing you might tell yourself is, “I want to get to Frida Kahlo’s painting.” (Note: These are the insights you want to get.)

Because your goal is to see this one painting that brought you miles away from your home and now sits two floors below, you head straight to the second floor. Beforehand, you memorized the shortest path to reach your goal. (Note: This is the initial data collection and discovery phase.)

However, along the way, you stumble upon some obstacles — the elevator is shut down for renovation, so you have to use the stairs. The museum paintings were reordered just two days ago, and the info plans didn’t reflect the changes, so the path you had in mind to get to the painting is not accurate.

Then you find yourself wandering around the third floor already, asking quietly again, “How do I get out of this labyrinth and get to my painting faster?”

While you don’t know the answer, you ask the museum staff on the third floor to help you out, and you start collecting the new data to get the correct route to your painting. (Note: This is a new data collection and discovery phase.)

Nonetheless, once you get to the second floor, you get lost again, but what you do next is start noticing a pattern in how the paintings have been ordered chronologically and thematically to group the artists whose styles overlap, thus giving you an indication of where to go to find your painting. (Note: This is a modelling phase overlapped with the enrichment phase from the dataset you collected during school days — your art knowledge.)

Finally, after adapting the pattern analysis and recalling the collected inputs on the museum route, you arrive in front of the painting you had been planning to see since booking your flight a few months ago.

What I described now is how you approach data science and, nowadays, generative AI problems. You always start with the end goal in mind and ask yourself:

“What is the expected outcome I want or need to get from this?”

Then you start planning from this question backwards. The example above started with requesting holidays, booking flights, arranging accommodation, traveling to a destination, buying museum tickets, wandering around in a museum, and then seeing the painting you’ve been reading about for ages.

Of course, there is more to it, and this process should be approached differently if you need to solve someone else’s problem, which is a bit more complex than locating the painting in the museum.

In this case, you have to…

Ask the “good” questions.

To do this, let’s define what a good question means [1]:

A good data science question must be concrete, tractable, and answerable. Your question works well if it naturally points to a feasible approach for your project. If your question is too vague to suggest what data you need, it won’t effectively guide your work.

Formulating good questions keeps you on track so you don’t get lost in the data that should be used to get to the specific problem solution, or you don’t end up solving the wrong problem.

Going into more detail, good questions will help identify gaps in reasoning, avoid faulty premises, and create alternative scenarios in case things do go south (which almost always happens).

Image created by Author after analyzing “Chapter 2. Setting goals by asking good questions” from “Think Like a Data Scientist” book [2]

From the above-presented diagram, you understand how good questions, first and foremost, need to support concrete assumptions. This means they need to be formulated in a way that your premises are clear and ensure they can be tested without mixing up facts with opinions.

Good questions produce answers that move you closer to your goal, whether through confirming hypotheses, providing new insights, or eliminating wrong paths. They are measurable, and with this, they connect to project goals because they are formulated with consideration of what’s possible, valuable, and efficient [2].

Good questions are answerable with available data, considering current data relevance and limitations.

Last but not least, good questions anticipate obstacles. If something is certain in data science, this is the uncertainty, so having backup plans when things don’t work as expected is important to produce results for your project.

Let’s exemplify this with one use case of an airline company that has a challenge with increasing its fleet availability due to unplanned technical groundings (UTG).

These unexpected maintenance events disrupt flights and cost the company significant money. Because of this, executives decided to react to the problem and call in a data scientist (you) to help them improve aircraft availability.

Now, if this would be the first data science task you ever got, you would maybe start an investigation by asking:

“How can we eliminate all unplanned maintenance events?”

You understand how this question is an example of the wrong or “poor” one because:

It is not realistic: It includes every possible defect, both small and big, into one impossible goal of “zero operational interruptions”.
It doesn’t hold a measure of success: There’s no concrete metric to show progress, and if you’re not at zero, you’re at “failure.”
It is not data-driven: The question didn’t cover which data is recorded before delays occur, and how the aircraft unavailability is measured and reported from it.

So, instead of this vague question, you would probably ask a set of targeted questions:

Which aircraft (sub)system is most critical to flight disruptions?
(Concrete, specific, answerable) This question narrows down your scope, focusing on only one or two specific (sub) systems affecting most delays.
What constitutes “critical downtime” from an operational perspective?
(Valuable, ties to business goals) If the airline (or regulatory body) doesn’t define how many minutes of unscheduled downtime matter for schedule disruptions, you might waste effort solving less urgent issues.
Which data sources capture the root causes, and how can we fuse them?
(Manageable, narrows the scope of the project further) This clarifies which data sources one would need to find the problem solution.

With these sharper questions, you will drill down to the real problem:

Not all delays weigh the same in cost or impact. The “correct” data science problem is to predict critical subsystem failures that lead to operationally costly interruptions so maintenance crews can prioritize them.

That’s why…

Defining the problem determines every step after.

It’s the foundation upon which your data, modelling, and evaluation phases are built .

Image created by Author after analyzing and overlapping different images from “Chapter 2. Setting goals by asking good questions, Think Like a Data Scientist” book [2]

It means you are clarifying the project’s objectives, constraints, and scope; you need to articulate the ultimate goal first and, except for asking “What’s the expected outcome I want or need to get from this?”, ask as well:

What would success look like and how can we measure it?

From there, drill down to (possible) next-level questions that you (I) have learned from the Ivory Tower days:
— History questions: “Has anyone tried to solve this before? What happened? What is still missing?”
— Context questions: “Who is affected by this problem and how? How are they partially resolving it now? Which sources, methods, and tools are they using now, and can they still be reused in the new models?”
— Impact Questions: “What happens if we don’t solve this? What changes if we do? Is there a value we can create by default? How much will this approach cost?”
— Assumption Questions: “What are we taking for granted that might not be true (especially when it comes to data and stakeholders’ ideas)?”
— ….

Then, do this in the loop and always “ask, ask again, and don’t stop asking” questions so you can drill down and understand which data and analysis are needed and what the ground problem is.

This is the evergreen knowledge you can apply nowadays, too, when deciding if your problem is of a predictive or generative nature.

(More about this in some other note where I will explain how problematic it is trying to solve the problem with the models that have never seen — or have never been trained on — similar problems before.)

Now, going back to memory lane…

I want to add one important note: I have learned from late nights in the Ivory Tower that no amount of data or data science knowledge can save you if you’re solving the wrong problem and trying to get the solution (answer) from a question that was simply wrong and vague.

When you have a problem on hand, do not rush into assumptions or building the models without understanding what you need to do (Festina lente).

In addition, prepare yourself for unexpected situations and do a proper investigation with your stakeholders and domain experts because their patience will be limited, too.

With this, I want to say that the “real art” of being successful in data projects is knowing precisely what the problem is, figuring out if it can be solved in the first place, and then coming up with the “how” part.

You get there by learning to ask good questions.

To end this narrative, recall how Einstein famously said:

If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.

Thank you for reading, and stay tuned for the next Ivory Tower note.

If you found this post valuable, feel free to share it with your network.

Connect for more stories on Medium and LinkedIn .

References:

[1] DS4Humans, Backwards Design, accessed: April 5th 2025, https://ds4humans.com/40_in_practice/05_backwards_design.html#defining-a-good-question

[2] Godsey, B. (2017), Think Like a Data Scientist: Tackle the data science process step-by-step, Manning Publications.

The post Ivory Tower Notes: The Problem appeared first on Towards Data Science.

Mining Rules from Data

Mariya Mansurova — Wed, 09 Apr 2025 16:54:40 +0000

Working with products, we might face a need to introduce some “rules”. Let me explain what I mean by “rules” in practical examples:

Imagine that we’re seeing a massive wave of fraud in our product, and we want to restrict onboarding for a particular segment of customers to lower this risk. For example, we found out that the majority of fraudsters had specific user agents and IP addresses from certain countries.
Another option is to send coupons to customers to use in our online shop. However, we would like to treat only customers who are likely to churn since loyal users will return to the product anyway. We might figure out that the most feasible group is customers who joined less than a year ago and decreased their spending by 30%+ last month.
Transactional businesses often have a segment of customers where they are losing money. For example, a bank customer passed the verification and regularly reached out to customer support (so generated onboarding and servicing costs) while doing almost no transactions (so not generating any revenue). The bank might introduce a small monthly subscription fee for customers with less than 1000$ in their account since they are likely non-profitable.

Of course, in all these cases, we might have used a complex Machine Learning model that would take into account all the factors and predict the probability (either of a customer being a fraudster or churning). Still, under some circumstances, we might prefer just a set of static rules for the following reasons:

The speed and complexity of implementation. Deploying an ML model in production takes time and effort. If you are experiencing a fraud wave right now, it might be more feasible to go live with a set of static rules that can be implemented quickly and then work on a comprehensive solution.
Interpretability. ML models are black boxes. Even though we might be able to understand at a high level how they work and what features are the most important ones, it’s challenging to explain them to customers. In the example of subscription fees for non-profitable customers, it’s important to share a set of transparent rules with customers so that they can understand the pricing.
Compliance. Some industries, like finance or healthcare, might require auditable and rule-based decisions to meet compliance requirements.

In this article, I want to show you how we can solve business problems using such rules. We will take a practical example and go really deep into this topic:

we will discuss which models we can use to mine such rules from data,
we will build a Decision Tree Classifier from scratch to learn how it works,
we will fit the sklearn Decision Tree Classifier model to extract the rules from the data,
we will learn how to parse the Decision Tree structure to get the resulting segments,
finally, we will explore different options for category encoding, since the sklearn implementation doesn’t support categorical variables.

We have lots of topics to cover, so let’s jump into it.

Case

As usual, it’s easier to learn something with a practical example. So, let’s start by discussing the task we will be solving in this article.

We will work with the Bank Marketing dataset (CC BY 4.0 license). This dataset contains data about the direct marketing campaigns of a Portuguese banking institution. For each customer, we know a bunch of features and whether they subscribed to a term deposit (our target).

Our business goal is to maximise the number of conversions (subscriptions) with limited operational resources. So, we can’t call the whole user base, and we want to reach the best outcome with the resources we have.

The first step is to look at the data. So, let’s load the data set.

import pandas as pd
pd.set_option('display.max_colwidth', 5000)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

df = pd.read_csv('bank-full.csv', sep = ';')
df = df.drop(['duration', 'campaign'], axis = 1)
# removed columns related to the current marketing campaign, 
# since they introduce data leakage

df.head()

We know quite a lot about the customers, including personal data (such as job type or marital status) and their previous behaviour (such as whether they have a loan or their average yearly balance).

Image by author

The next step is to select a machine-learning model. There are two classes of models that are usually used when we need something easily interpretable:

decision trees,
linear or logistic regression.

Both options are feasible and can give us good models that can be easily implemented and interpreted. However, in this article, I would like to stick to the decision tree model because it produces actual rules, while logistic regression will give us probability as a weighted sum of features.

Data Preprocessing

As we’ve seen in the data, there are lots of categorical variables (such as education or marital status). Unfortunately, the sklearn decision tree implementation can’t handle categorical data, so we need to do some preprocessing.

Let’s start by transforming yes/no flags into integers.

for p in ['default', 'housing', 'loan', 'y']:
    df[p] = df[p].map(lambda x: 1 if x == 'yes' else 0)

The next step is to transform the month variable. We can use one-hot encoding for months, introducing flags like month_jan , month_feb , etc. However, there might be seasonal effects, and I think it would be more reasonable to convert months into integers following their order.

month_map = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
# I saved 5 mins by asking ChatGPT to do this mapping

df['month'] = df.month.map(lambda x: month_map[x] if x in month_map else x)

For all other categorical variables, let’s use one-hot encoding. We will discuss different strategies for category encoding later, but for now, let’s stick to the default approach.

The easiest way to do one-hot encoding is to leverage get_dummies function in pandas.

fin_df = pd.get_dummies(
  df, columns=['job', 'marital', 'education', 'poutcome', 'contact'], 
  dtype = int, # to convert to flags 0/1
  drop_first = False # to keep all possible values
)

This function transforms each categorical variable into a separate 1/0 column for each possible. We can see how it works for poutcome column.

fin_df.merge(df[['id', 'poutcome']])\
    .groupby(['poutcome', 'poutcome_unknown', 'poutcome_failure', 
      'poutcome_other', 'poutcome_success'], as_index = False).y.count()\
    .rename(columns = {'y': 'cases'})\
    .sort_values('cases', ascending = False)

Image by author

Our data is now ready, and it’s time to discuss how decision tree classifiers work.

Decision Tree Classifier: Theory

In this section, we’ll explore the theory behind the Decision Tree Classifier and build the algorithm from scratch. If you’re more interested in a practical example, feel free to skip ahead to the next part.

The easiest way to understand the decision tree model is to look at an example. So, let’s build a simple model based on our data. We will use DecisionTreeClassifier from sklearn.

feature_names = fin_df.drop(['y'], axis = 1).columns
model = sklearn.tree.DecisionTreeClassifier(
  max_depth = 2, min_samples_leaf = 1000)
model.fit(fin_df[feature_names], fin_df['y'])

The next step is to visualise the tree.

dot_data = sklearn.tree.export_graphviz(
    model, out_file=None, feature_names = feature_names, filled = True, 
    proportion = True, precision = 2 
    # to show shares of classes instead of absolute numbers
)

graph = graphviz.Source(dot_data)
graph

Image by author

So, we can see that the model is straightforward. It’s a set of binary splits that we can use as heuristics.

Let’s figure out how the classifier works under the hood. As usual, the best way to understand the model is to build the logic from scratch.

The cornerstone of any problem is the optimisation function. By default, in the decision tree classifier, we’re optimising the Gini coefficient. Imagine getting one random item from the sample and then the other. The Gini coefficient would equal the probability of the situation when these items are from different classes. So, our goal will be minimising the Gini coefficient.

In the case of just two classes (like in our example, where marketing intervention was either successful or not), the Gini coefficient is defined just by one parameter p , where p is the probability of getting an item from one of the classes. Here’s the formula:

\[\textbf{gini}(\textsf{p}) = 1 – \textsf{p}^2 – (1 – \textsf{p})^2 = 2 * \textsf{p} * (1 – \textsf{p}) \]

If our classification is ideal and we are able to separate the classes perfectly, then the Gini coefficient will be equal to 0. The worst-case scenario is when p = 0.5 , then the Gini coefficient is also equal to 0.5.

With the formula above, we can calculate the Gini coefficient for each leaf of the tree. To calculate the Gini coefficient for the whole tree, we need to combine the Gini coefficients of binary splits. For that, we can just get a weighted sum:

\[\textbf{gini}_{\textsf{total}} = \textbf{gini}_{\textsf{left}} * \frac{\textbf{n}_{\textsf{left}}}{\textbf{n}_{\textsf{left}} + \textbf{n}_{\textsf{right}}} + \textbf{gini}_{\textsf{right}} * \frac{\textbf{n}_{\textsf{right}}}{\textbf{n}_{\textsf{left}} + \textbf{n}_{\textsf{right}}}\]

Now that we know what value we’re optimising, we only need to define all possible binary splits, iterate through them and choose the best option.

Defining all possible binary splits is also quite straightforward. We can do it one by one for each parameter, sort possible values, and pick up thresholds between them. For example, for months (integer from 1 to 12).

Image by author

Let’s try to code it and see whether we will come to the same result. First, we will define functions that calculate the Gini coefficient for one dataset and the combination.

def get_gini(df):
    p = df.y.mean()
    return 2*p*(1-p)

print(get_gini(fin_df)) 
# 0.2065
# close to what we see at the root node of Decision Tree

def get_gini_comb(df1, df2):
    n1 = df1.shape[0]
    n2 = df2.shape[0]

    gini1 = get_gini(df1)
    gini2 = get_gini(df2)
    return (gini1*n1 + gini2*n2)/(n1 + n2)

The next step is to get all possible thresholds for one parameter and calculate their Gini coefficients.

import tqdm
def optimise_one_parameter(df, param):
    tmp = []
    possible_values = list(sorted(df[param].unique()))
    print(param)

    for i in tqdm.tqdm(range(1, len(possible_values))): 
        threshold = (possible_values[i-1] + possible_values[i])/2
        gini = get_gini_comb(df[df[param] <= threshold], 
          df[df[param] > threshold])
        tmp.append(
            {'param': param, 
            'threshold': threshold, 
            'gini': gini, 
            'sizes': (df[df[param] <= threshold].shape[0], df[df[param] > threshold].shape[0]))
            }
        )
    return pd.DataFrame(tmp)

The final step is to iterate through all features and calculate all possible splits.

tmp_dfs = []
for feature in feature_names:
    tmp_dfs.append(optimise_one_parameter(fin_df, feature))
opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', asceding = True).head(5)

Image by author

Wonderful, we’ve got the same result as in our DecisionTreeClassifier model. The optimal split is whether poutcome = success or not. We’ve reduced the Gini coefficient from 0.2065 to 0.1872.

To continue building the tree, we need to repeat the process recursively. For example, going down for the poutcome_success <= 0.5 branch:

tmp_dfs = []
for feature in feature_names:
    tmp_dfs.append(optimise_one_parameter(
      fin_df[fin_df.poutcome_success <= 0.5], feature))

opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', ascending = True).head(5)

Image by author

The only question we still need to discuss is the stopping criteria. In our initial example, we’ve used two conditions:

max_depth = 2 — it just limits the maximum depth of the tree,
min_samples_leaf = 1000 prevents us from getting leaf nodes with less than 1K samples. Because of this condition, we’ve chosen a binary split by contact_unknown even though age led to a lower Gini coefficient.

Also, I usually limit the min_impurity_decrease that prevent us from going further if the gains are too small. By gains, we mean the decrease of the Gini coefficient.

So, we’ve understood how the Decision Tree Classifier works, and now it’s time to use it in practice.

If you’re interested to see how Decision Tree Regressor works in all detail, you can look it up in my previous article.

Decision Trees: practice

We’ve already built a simple tree model with two layers, but it’s definitely not enough since it’s too simple to get all the insights from the data. Let’s train another Decision Tree by limiting the number of samples in leaves and decreasing impurity (reduction of Gini coefficient).

model = sklearn.tree.DecisionTreeClassifier(
  min_samples_leaf = 1000, min_impurity_decrease=0.001)
model.fit(fin_df[features], fin_df['y'])

dot_data = sklearn.tree.export_graphviz(
    model, out_file=None, feature_names = features, filled = True, 
    proportion = True, precision=2, impurity = True)

graph = graphviz.Source(dot_data)

# saving graph to png file
png_bytes = graph.pipe(format='png')
with open('decision_tree.png','wb') as f:
    f.write(png_bytes)

Image by author

That’s it. We’ve got our rules to split customers into groups (leaves). Now, we can iterate through groups and see which groups of customers we want to contact. Even though our model is relatively small, it’s daunting to copy all conditions from the image. Luckily, we can parse the tree structure and get all the groups from the model.

The Decision Tree classifier has an attribute tree_ that will allow us to get access to low-level attributes of the tree, such as node_count .

n_nodes = model.tree_.node_count
print(n_nodes)
# 13

The tree_ variable also stores the entire tree structure as parallel arrays, where the i_th element of each array stores the information about the node i. For the root i equals to 0.

Here are the arrays we have to represent the tree structure:

children_left and children_right — IDs of left and right nodes, respectively; if the node is a leaf, then -1.
feature — feature used to split the node i .
threshold — threshold value used for the binary split of the node i .
n_node_samples — number of training samples that reached the node i .
values — shares of samples from each class.

Let’s save all these arrays.

children_left = model.tree_.children_left
# [ 1,  2,  3,  4,  5,  6, -1, -1, -1, -1, -1, -1, -1]
children_right = model.tree_.children_right
# [12, 11, 10,  9,  8,  7, -1, -1, -1, -1, -1, -1, -1]
features = model.tree_.feature
# [30, 34,  0,  3,  6,  6, -2, -2, -2, -2, -2, -2, -2]
thresholds = model.tree_.threshold
# [ 0.5,  0.5, 59.5,  0.5,  6.5,  2.5, -2. , -2. , -2. , -2. , -2. , -2. , -2. ]
num_nodes = model.tree_.n_node_samples
# [45211, 43700, 30692, 29328, 14165,  4165,  2053,  2112, 10000, 
#  15163,  1364, 13008,  1511] 
values = model.tree_.value
# [[[0.8830152 , 0.1169848 ]],
# [[0.90135011, 0.09864989]],
# [[0.87671054, 0.12328946]],
# [[0.88550191, 0.11449809]],
# [[0.8530886 , 0.1469114 ]],
# [[0.76686675, 0.23313325]],
# [[0.87043351, 0.12956649]],
# [[0.66619318, 0.33380682]],
# [[0.889     , 0.111     ]],
# [[0.91578184, 0.08421816]],
# [[0.68768328, 0.31231672]],
# [[0.95948647, 0.04051353]],
# [[0.35274653, 0.64725347]]]

It will be more convenient for us to work with a hierarchical view of the tree structure, so let’s iterate through all nodes and, for each node, save the parent node ID and whether it was a right or left branch.

hierarchy = {}

for node_id in range(n_nodes):
  if children_left[node_id] != -1: 
    hierarchy[children_left[node_id]] = {
      'parent': node_id, 
      'condition': 'left'
    }
  
  if children_right[node_id] != -1:
      hierarchy[children_right[node_id]] = {
       'parent': node_id, 
       'condition': 'right'
  }

print(hierarchy)
# {1: {'parent': 0, 'condition': 'left'},
# 12: {'parent': 0, 'condition': 'right'},
# 2: {'parent': 1, 'condition': 'left'},
# 11: {'parent': 1, 'condition': 'right'},
# 3: {'parent': 2, 'condition': 'left'},
# 10: {'parent': 2, 'condition': 'right'},
# 4: {'parent': 3, 'condition': 'left'},
# 9: {'parent': 3, 'condition': 'right'},
# 5: {'parent': 4, 'condition': 'left'},
# 8: {'parent': 4, 'condition': 'right'},
# 6: {'parent': 5, 'condition': 'left'},
# 7: {'parent': 5, 'condition': 'right'}}

The next step is to filter out the leaf nodes since they are terminal and the most interesting for us as they define the customer segments.

leaves = []
for node_id in range(n_nodes):
    if (children_left[node_id] == -1) and (children_right[node_id] == -1):
        leaves.append(node_id)
print(leaves)
# [6, 7, 8, 9, 10, 11, 12]
leaves_df = pd.DataFrame({'node_id': leaves})

The next step is to determine all the conditions applied to each group since they will define our customer segments. The first function get_condition will give us the tuple of feature, condition type and threshold for a node.

def get_condition(node_id, condition, features, thresholds, feature_names):
    # print(node_id, condition)
    feature = feature_names[features[node_id]]
    threshold = thresholds[node_id]
    cond = '>' if condition == 'right'  else '<='
    return (feature, cond, threshold)

print(get_condition(0, 'left', features, thresholds, feature_names)) 
# ('poutcome_success', '<=', 0.5)

print(get_condition(0, 'right', features, thresholds, feature_names))
# ('poutcome_success', '>', 0.5)

The next function will allow us to recursively go from the leaf node to the root and get all the binary splits.

def get_decision_path_rec(node_id, decision_path, hierarchy):
  if node_id == 0:
    yield decision_path 
  else:
    parent_id = hierarchy[node_id]['parent']
    condition = hierarchy[node_id]['condition']
    for res in get_decision_path_rec(parent_id, decision_path + [(parent_id, condition)], hierarchy):
        yield res

decision_path = list(get_decision_path_rec(12, [], hierarchy))[0]
print(decision_path) 
# [(0, 'right')]

fmt_decision_path = list(map(
  lambda x: get_condition(x[0], x[1], features, thresholds, feature_names), 
  decision_path))
print(fmt_decision_path)
# [('poutcome_success', '>', 0.5)]

Let’s save the logic of executing the recursion and formatting into a wrapper function.

def get_decision_path(node_id, features, thresholds, hierarchy, feature_names):
  decision_path = list(get_decision_path_rec(node_id, [], hierarchy))[0]
  return list(map(lambda x: get_condition(x[0], x[1], features, thresholds, 
    feature_names), decision_path))

We’ve learned how to get each node’s binary split conditions. The only remaining logic is to combine the conditions.

def get_decision_path_string(node_id, features, thresholds, hierarchy, 
  feature_names):
  conditions_df = pd.DataFrame(get_decision_path(node_id, features, thresholds, hierarchy, feature_names))
  conditions_df.columns = ['feature', 'condition', 'threshold']

  left_conditions_df = conditions_df[conditions_df.condition == '<=']
  right_conditions_df = conditions_df[conditions_df.condition == '>']

  # deduplication 
  left_conditions_df = left_conditions_df.groupby(['feature', 'condition'], as_index = False).min()
  right_conditions_df = right_conditions_df.groupby(['feature', 'condition'], as_index = False).max()
  
  # concatination
  fin_conditions_df = pd.concat([left_conditions_df, right_conditions_df])\
      .sort_values(['feature', 'condition'], ascending = False)
  
  # formatting 
  fin_conditions_df['cond_string'] = list(map(
      lambda x, y, z: '(%s %s %.2f)' % (x, y, z),
      fin_conditions_df.feature,
      fin_conditions_df.condition,
      fin_conditions_df.threshold
  ))
  return ' and '.join(fin_conditions_df.cond_string.values)

print(get_decision_path_string(12, features, thresholds, hierarchy, 
  feature_names))
# (poutcome_success > 0.50)

Now, we can calculate the conditions for each group.

leaves_df['condition'] = leaves_df['node_id'].map(
  lambda x: get_decision_path_string(x, features, thresholds, hierarchy, 
  feature_names)
)

The last step is to add their size and conversion to the groups.

leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total)\
  .map(lambda x: int(round(x/100)))
leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()

Now, we can use these rules to make decisions. We can sort groups by conversion (probability of successful contact) and pick the customers with the highest probability.

leaves_df.sort_values('conversion', ascending = False)\
  .drop('node_id', axis = 1).set_index('condition')

Image by author

Imagine we have resources to contact only around 10% of our user base, we can focus on the first three groups. Even with such a limited capacity, we would expect to get almost 40% conversion — it’s a really good result, and we’ve achieved it with just a bunch of straightforward heuristics.

In real life, it’s also worth testing the model (or heuristics) before deploying it in production. I would split the training dataset into training and validation parts (by time to avoid leakage) and see the heuristics performance on the validation set to have a better view of the actual model quality.

Working with high cardinality categories

Another topic that is worth discussing in this context is category encoding, since we have to encode the categorical variables for sklearn implementation. We’ve used a straightforward approach with one-hot encoding, but in some cases, it doesn’t work.

Imagine we also have a region in the data. I’ve synthetically generated English cities for each row. We have 155 unique regions, so the number of features has increased to 190.

model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 100, min_impurity_decrease=0.001)
model.fit(fin_df[feature_names], fin_df['y'])

So, the basic tree now has lots of conditions based on regions and it’s not convenient to work with them.

Image by author

In such a case, it might not be meaningful to explode the number of features, and it’s time to think about encoding. There’s a comprehensive article, “Categorically: Don’t explode — encode!”, that shares a bunch of different options to handle high cardinality categorical variables. I think the most feasible ones in our case will be the following two options:

Count or Frequency Encoder that shows good performance in benchmarks. This encoding assumes that categories of similar size would have similar characteristics.
Target Encoder, where we can encode the category by the mean value of the target variable. It will allow us to prioritise segments with higher conversion and deprioritise segments with lower. Ideally, it would be nice to use historical data to get the averages for the encoding, but we will use the existing dataset.

However, it will be interesting to test different approaches, so let’s split our dataset into train and test, saving 10% for validation. For simplicity, I’ve used one-hot encoding for all columns except for region (since it has the highest cardinality).

from sklearn.model_selection import train_test_split
fin_df = pd.get_dummies(df, columns=['job', 'marital', 'education', 
  'poutcome', 'contact'], dtype = int, drop_first = False)
train_df, test_df = train_test_split(fin_df,test_size=0.1, random_state=42)
print(train_df.shape[0], test_df.shape[0])
# (40689, 4522)

For convenience, let’s combine all the logic for parsing the tree into one function.

def get_model_definition(model, feature_names):
  n_nodes = model.tree_.node_count
  children_left = model.tree_.children_left
  children_right = model.tree_.children_right
  features = model.tree_.feature
  thresholds = model.tree_.threshold
  num_nodes = model.tree_.n_node_samples
  values = model.tree_.value

  hierarchy = {}

  for node_id in range(n_nodes):
      if children_left[node_id] != -1: 
          hierarchy[children_left[node_id]] = {
            'parent': node_id, 
            'condition': 'left'
          }
    
      if children_right[node_id] != -1:
            hierarchy[children_right[node_id]] = {
             'parent': node_id, 
             'condition': 'right'
            }

  leaves = []
  for node_id in range(n_nodes):
      if (children_left[node_id] == -1) and (children_right[node_id] == -1):
          leaves.append(node_id)
  leaves_df = pd.DataFrame({'node_id': leaves})
  leaves_df['condition'] = leaves_df['node_id'].map(
    lambda x: get_decision_path_string(x, features, thresholds, hierarchy, feature_names)
  )

  leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
  leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
  leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total).map(lambda x: int(round(x/100)))
  leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
  leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()
  leaves_df = leaves_df.sort_values('conversion', ascending = False)\
    .drop('node_id', axis = 1).set_index('condition')
  leaves_df['cum_share_of_total'] = leaves_df['share_of_total'].cumsum()
  leaves_df['cum_share_of_converted'] = leaves_df['share_of_converted'].cumsum()
  return leaves_df

Let’s create an encodings data frame, calculating frequencies and conversions.

region_encoding_df = train_df.groupby('region', as_index = False)\
  .aggregate({'id': 'count', 'y': 'mean'}).rename(columns = 
    {'id': 'region_count', 'y': 'region_target'})

Then, merge it into our training and validation sets. For the validation set, we will also fill NAs as averages.

train_df = train_df.merge(region_encoding_df, on = 'region')

test_df = test_df.merge(region_encoding_df, on = 'region', how = 'left')
test_df['region_target'] = test_df['region_target']\
  .fillna(region_encoding_df.region_target.mean())
test_df['region_count'] = test_df['region_count']\
  .fillna(region_encoding_df.region_count.mean())

Now, we can fit the models and get their structures.

count_feature_names = train_df.drop(
  ['y', 'id', 'region_target', 'region'], axis = 1).columns
target_feature_names = train_df.drop(
  ['y', 'id', 'region_count', 'region'], axis = 1).columns
print(len(count_feature_names), len(target_feature_names))
# (36, 36)

count_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
count_model.fit(train_df[count_feature_names], train_df['y'])

target_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
target_model.fit(train_df[target_feature_names], train_df['y'])

count_model_def_df = get_model_definition(count_model, count_feature_names)
target_model_def_df = get_model_definition(target_model, target_feature_names)

Let’s look at the structures and select the top categories up to 10–15% of our target audience. We can also apply these conditions to our validation sets to test our approach in practice.

Let’s start with Count Encoder.

Image by author

count_selected_df = test_df[
    (test_df.poutcome_success > 0.50) | 
    ((test_df.poutcome_success <= 0.50) & (test_df.age > 60.50)) | 
    ((test_df.region_count > 3645.50) & (test_df.region_count <= 8151.50) & 
         (test_df.poutcome_success <= 0.50) & (test_df.contact_cellular > 0.50) & (test_df.age <= 60.50))
]

print(count_selected_df.shape[0], count_selected_df.y.sum())
# (508, 227)

We can also see what regions have been selected, and it’s only Manchester.

Image by author

Let’s continue with the Target encoding.

Image by author

target_selected_df = test_df[
    ((test_df.region_target > 0.21) & (test_df.poutcome_success > 0.50)) | 
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month <= 6.50) & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month > 8.50) & (test_df.housing <= 0.50) 
         & (test_df.contact_unknown <= 0.50)) |
    ((test_df.region_target <= 0.21) & (test_df.poutcome_success > 0.50)) |
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month > 6.50) & (test_df.month <= 8.50) 
         & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50))
]

print(target_selected_df.shape[0], target_selected_df.y.sum())
# (502, 248)

We see a slightly lower number of selected users for communication but a significantly higher number of conversions: 248 vs. 227 (+9.3%).

Let’s also look at the selected categories. We see that the model picked up all the cities with high conversions (Manchester, Liverpool, Bristol, Leicester, and New Castle), but there are also many small regions with high conversions solely due to chance.

region_encoding_df[region_encoding_df.region_target > 0.21]\
  .sort_values('region_count', ascending = False)

Image by author

In our case, it doesn’t impact much since the share of such small cities is low. However, if you have way more small categories, you might see significant drawbacks of overfitting. Target Encoding might be tricky at this point, so it’s worth keeping an eye on the output of your model.

Luckily, there’s an approach that can help you overcome this issue. Following the article “Encoding Categorical Variables: A Deep Dive into Target Encoding”, we can add smoothing. The idea is to combine the group’s conversion rate with the overall average: the larger the group, the more weight its data carries, while smaller segments will lean more towards the global average.

First, I’ve selected the parameters that make sense for our distribution, looking at a bunch of options. I chose to use the global average for the groups under 100 people. This part is a bit subjective, so use common sense and your knowledge about the business domain.

import numpy as np
import matplotlib.pyplot as plt

global_mean = train_df.y.mean()

k = 100
f = 10
smooth_df = pd.DataFrame({'region_count':np.arange(1, 100001, 1) })
smooth_df['smoothing'] = (1 / (1 + np.exp(-(smooth_df.region_count - k) / f)))

ax = plt.scatter(smooth_df.region_count, smooth_df.smoothing)
plt.xscale('log')
plt.ylim([-.1, 1.1])
plt.title('Smoothing')

Image by author

Then, we can calculate, based on the selected parameters, the smoothing coefficients and blended averages.

region_encoding_df['smoothing'] = (1 / (1 + np.exp(-(region_encoding_df.region_count - k) / f)))
region_encoding_df['region_target'] = region_encoding_df.smoothing * region_encoding_df.raw_region_target \
    + (1 - region_encoding_df.smoothing) * global_mean

Then, we can fit another model with smoothed target category encoding.

train_df = train_df.merge(region_encoding_df[['region', 'region_target']], 
  on = 'region')
test_df = test_df.merge(region_encoding_df[['region', 'region_target']], 
  on = 'region', how = 'left')
test_df['region_target'] = test_df['region_target']\
  .fillna(region_encoding_df.region_target.mean())

target_v2_feature_names = train_df.drop(['y', 'id', 'region'], axis = 1)\
  .columns

target_v2_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
target_v2_model.fit(train_df[target_v2_feature_names], train_df['y'])
target_v2_model_def_df = get_model_definition(target_v2_model, 
  target_v2_feature_names)

Image by author

target_v2_selected_df = test_df[
    ((test_df.region_target > 0.12) & (test_df.poutcome_success > 0.50)) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month <= 6.50) & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month > 8.50) & (test_df.housing <= 0.50) 
         & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target <= 0.12) & (test_df.poutcome_success > 0.50) ) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month > 6.50) & (test_df.month <= 8.50) 
         & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50) )
]

target_v2_selected_df.shape[0], target_v2_selected_df.y.sum()
# (500, 247)

We can see that we’ve eliminated the small cities and prevented overfitting in our model while keeping roughly the same performance, capturing 247 conversions.

region_encoding_df[region_encoding_df.region_target > 0.12]

Image by author

You can also use TargetEncoder from sklearn, which smoothes and mixes the category and global means depending on the segment size. However, it also adds random noise, which is not ideal for our case of heuristics.

You can find the full code on GitHub.

Summary

In this article, we explored how to extract simple “rules” from data and use them to inform business decisions. We generated heuristics using a Decision Tree Classifier and touched on the important topic of categorical encoding since decision tree algorithms require categorical variables to be converted.

We saw that this rule-based approach can be surprisingly effective, helping you reach business decisions quickly. However, it’s worth noting that this simplistic approach has its drawbacks:

We are trading off the model’s power and accuracy for its simplicity and interpretability, so if you’re optimising for accuracy, choose another approach.
Even though we’re using a set of static heuristics, your data still can change, and they might become outdated, so you need to recheck your model from time to time.

Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

Dataset: Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306

The post Mining Rules from Data appeared first on Towards Data Science.

Avoiding Costly Mistakes with Uncertainty Quantification for Algorithmic Home Valuations

Arne Johan Pollestad — Tue, 08 Apr 2025 00:40:22 +0000

When you’re about to buy a home, whether you’re an everyday buyer looking for your dream house or a seasoned property investor, there’s a good chance you’ve encountered automated valuation models, or AVMs. These clever tools use massive datasets filled with past property transactions to predict the value of your potential new home. By considering features like location, number of bedrooms, bathrooms, property age, and more, AVMs use AI to learn associations with sales prices. A rapid and low-cost appraisal of any home sounds great on paper, and in many cases it is great. However, with every price prediction comes a level of uncertainty, and failing to consider this uncertainty can be a costly mistake. In this post, I illustrate the application of AI-uncertainty quantification for AVMs through the AVMU methodology.

Price Prediction Uncertainty?

Let’s start off simple. Imagine you’re looking for a two-story, four-bedroom house in a cozy neighborhood in Virginia Beach, VA. You’ve downloaded some local housing data and used it to train your own AVM (you’re tech-savvy like that!).

Case 1: Lucky you, several almost identical homes in the neighborhood have sold for around $500,000 in the past year. Your AVM confidently suggests the home you’re interested in will also likely be worth around the same price. Easy enough, right?

But here’s where it gets trickier:

Case 2: This time, no similar two-story, four-bedroom homes have sold recently. Instead, your dataset shows smaller, one-story homes selling at $400,000, and larger, three-story homes going for $600,000. Your AVM averages things out and again suggests $500,000. It makes sense, your target house is bigger than the cheaper homes and smaller than the pricier ones.

Both scenarios gave you the same $500,000 valuation. However, there’s a catch: The first scenario is backed by solid data (similar homes selling recently), making the price prediction quite reliable. In the second scenario, on the other hand, trusting the price prediction might be a bit riskier. With fewer comparable sales, the AVM had to make “an educated guess”, leading to a less certain price prediction.

The solid AVM in Case 1 is a very helpful decision support tool for purchasing a home, but the shaky AVM in Case 2 can give you a totally wrong idea of the home’s market value. Here’s the big question:

How can you tell whether your AVM prediction is solid or shaky?

AVMU—An Uncertainty Quantification Technique for AVMs

This is exactly why we need AVMU, or Automated Valuation Model Uncertainty. AVMU is a recent methodological framework that helps us quantify exactly how reliable (or uncertain) these AVM predictions are. Think of it as a confidence meter for your house price prediction, helping you make smarter decisions instead of blindly trusting an algorithm.

Let’s return to our Virginia Beach example. You’ve browsed listings extensively and narrowed your choices down to two fantastic homes: let’s call them Home A and Home B.

Image by Author, made partly with DALL-E.

Of course, the first thing you want to know is their market values. Knowing the market value ensures you don’t overpay, potentially saving you from future financial headaches and having to resell the home at a loss. Unfortunately, you don’t have much knowledge about house prices in Virginia Beach, as you’re originally from [insert name of the place you grew up]. Fortunately, you recall the data science skills you picked up in grad school and confidently decide to build your own AVM to get a grasp of the market values of your two candidate homes.

To ensure your AVM predictions are as accurate as possible, you train the model using Mean Squared Error (MSE) as your loss function:

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2\]

Here, $ n $ is the number of homes in your training dataset, $ \hat{y}_i $ represents the AVM’s price prediction for home $ i $, and $ y_i $ is the actual price at which home $ i $ was sold.

Image by Author, made partly with DALL-E.

After training the model, you eagerly apply your AVM to Homes A and B. To your surprise (or perhaps excitement?), both homes are valued at exactly $500,000 by the algorithm. Very well, but just as you’re about to place an offer on home B, a thought strikes: these predictions aren’t absolute certainties. They’re “point predictions”, essentially the AVM’s best guess at the most likely market value. In fact, the true market value is probably somewhat higher or lower, and it’s rather unlikely that the AVM prediction nailed the market value down to the exact dollar.

So, how do we measure this uncertainty? This is where AVMU methodology comes into play, with a straightforward but powerful approach:

First, you use cross-validation (e.g., 5-fold CV) to generate out-of-fold price predictions, $ \hat{y}_i $, for all the $ n $ homes in your dataset.
Next, for each home, you calculate how far off the prediction was from the actual sales price. This difference is called the absolute deviation, $ |\hat{y}_i – y_i| $, between the price prediction, $ \hat{y}_i $, and the actual sales price, $ y_i $.
Then, instead of predicting sales prices, you train a separate “uncertainty model”, $ F(\hat{y}_i, x_i) $, using these absolute deviations, $ |\hat{y}_i – y_i| $, as the target. This special model learns patterns indicating when the AVM predictions are typically accurate or uncertain.
Finally, you apply this uncertainty model to estimate how uncertain the price predictions are for Homes A and B (i.e., your test set), by predicting their absolute price deviations. You now have simple uncertainty estimates for both of the homes.

Now, I know exactly what some of you might be thinking about the third step:

“Wait a second, you can’t just put a regression on top of another regression to explain why the first one is off!”

And you’d be absolutely right. Well, sort of. If there were clear, predictable data patterns showing that certain homes were consistently overpriced or underpriced by your AVM, that would mean your AVM wasn’t very good in the first place. Ideally, a good AVM should capture all meaningful patterns in the data. But here’s the clever twist: instead of predicting if a home is specifically overpriced or underpriced (what we call the signed deviation), we focus on absolute deviations. By doing this, we sidestep the issue of explaining if a home is valued too high or too low. Instead, we let the uncertainty model focus on identifying which types of homes the AVM tends to predict accurately and which ones it struggles with, no matter the direction of the error.

From a homebuyer’s perspective, you’re naturally more worried about overpaying. Imagine buying a home for $500,000 only to discover it’s actually worth just $400,000! But in practice, underestimating the value of a home is also more problematic than you’d think. Make an offer that’s too low, and you might just lose your dream home to another buyer. That’s why, as a savvy buyer equipped with AVM predictions, your goal isn’t just to chase the highest or lowest price prediction. Instead, your priority should be robust, reliable valuations that closely match the true market value. And thanks to the AVMU uncertainty estimates, you can now more confidently pinpoint exactly which predictions to trust.

Mathematically, the process described above can be written like this:

\[|\hat{y}_i – y_i| = F(\hat{y}_i, x_i) + \varepsilon_i \quad \text{for } 1 \leq i \leq n\]

and:

\[\text{AVMU}_i = F(\hat{y}_i, x_i)\]

The uncertainty model, $ F(\hat{y}_i, x_i) $, can be based on any regression algorithm (even the same one as your AVM). The difference is, for your uncertainty model you’re not necessarily interested in achieving perfect predictions for the absolute deviations. Instead, you’re interested in ranking the homes based on prediction uncertainty, and thereby learn which out of Home A’s and Home B’s price predictions you can trust the most. The MSE loss function used for the AVM (see first equation), might therefore not be the ideal choice.

Rather than using MSE, you therefore fit your uncertainty model, $ F(\hat{y}_i, x_i) $, to optimize a loss function more suited for ranking. An example of such a loss function is to maximize rank correlation (i.e., Spearman’s $ \rho $), given by:

\[\rho = 1 – \frac{6 \sum_{i=1}^{n} D_i^2}{n(n^2 – 1)}\]

Here, a higher $ \rho $ means your model ranks homes better regarding prediction uncertainty. $ D_i $ represents the difference in ranks between actual absolute deviations, $ |\hat{y}_i – y_i| $, and predicted uncertainties, $ \text{AVMU}_i = F(\hat{y}_i, x_i) $, for home $ i $.

Image by Author, made partly with DALL-E.

So now you have, for both candidate homes, an AVM price prediction and a corresponding AVMU uncertainty estimate. By combining these two measures, you quickly notice something interesting: even if multiple homes share the same “most likely market value”, the reliability of that predictions can vary greatly. In your case, you see that Home B comes with a significantly higher AVMU uncertainty estimate, signaling that its actual market value could stray far from the $500,000 valuation.

To protect yourself from the unnecessary risk, you wisely opt for purchasing Home A, whose AVM valuation of $500,000 is backed by stronger certainty. With confidence restored thanks to the AVMU, you happily finalize your purchase, knowing you’ve made a smart, data-informed choice, and celebrate your new home with a relaxing drink in your new front yard.

Image by Author, made partly with DALL-E.

Ethics and Other Applications of AVMU

This simple introduction to AVM price uncertainty and how AVMU can guide you when buying a home is just one of its many potential applications. Homes aren’t the only assets that could benefit from quick, low-cost valuation tools. While AVMs are commonly associated with housing due to plentiful data and easily identifiable characteristics, these models, and their uncertainty quantification via AVMU, can apply to virtually anything with a market price. Think about used cars, collectibles, or even pro soccer players. As long as there’s uncertainty in predicting their prices, AVMU can be used to understand it.

Sticking with housing, purchasing decisions aren’t the only area where AVMU could be used. Mortgage lenders frequently use AVMs to estimate the collateral value of properties, yet often overlook how uneven the accuracy of these price predictions can be. Similarly, tax authorities can use AVMs to determine your property taxes but may accidentally set unfair valuations due to unacknowledged uncertainty. Recognizing uncertainty through AVMU can help make these valuations fairer and more accurate across the board.

However, despite its versatility, it’s essential to remember neither AVMU is perfect. It’s still a statistical model relying on data quality and quantity. No model can completely eliminate uncertainty, especially the random aspects inherent in most markets, sometimes referred to as aleatoric or irreducible uncertainty. Imagine a newlywed couple falling head-over-heels for a particular kitchen, prompting them to bid way above the typical market value. Or perhaps bad weather negatively influencing someone’s perception of a house during a viewing. Such unpredictable scenarios will always exist, and AVMU can’t account for every outlier.

Remember, AVMU gives you probabilities, not fixed truths. A home with a higher AVMU uncertainty is more likely to experience price deviations, it is not a guaranteed. And if you find yourself thinking, “should I make third model to predict the uncertainty of my uncertainty model?”, it’s probably time to accept that some uncertainty is simply unavoidable. So, armed with your AVMU-informed insights, relax, embrace the uncertainty, and enjoy your new home!

References

A. J. Pollestad, A. B. Næss and A. Oust, Towards a Better Uncertainty Quantification in Automated Valuation Models (2024), The Journal of Real Estate Finance and Economics.
A. J. Pollestad and A. Oust, Harnessing uncertainty: a new approach to real estate investment decision support (2025), Quantitative Finance.

The post Avoiding Costly Mistakes with Uncertainty Quantification for Algorithmic Home Valuations appeared first on Towards Data Science.

Linear Programming: Managing Multiple Targets with Goal Programming

Jarom Hulet — Thu, 03 Apr 2025 19:12:16 +0000

This is the sixth (and likely last) part of a Linear Programming series I’ve been writing. With the core concepts covered by the prior articles, this article focuses on goal programming which is a less frequent linear programming (LP) use case. Goal programming is a specific linear programming setup that can handle the optimization of multiple objectives in a single LP problem.

By the end of this article, you’ll understand:
1. Definition of goal programming and when it should be used
2. The weighted goal programming approach — illustrated with an example
3. The preemptive goal programming approach — illustrated with an example

Definition and Use Case of Goal Programming

Goal programming is a special case of linear programming that allows for multiple — often conflicting — objectives to be balanced. With regular LP problems, you pick a single metric to optimize (minimize or maximize) and set constraints to ensure that the optimal solution is feasible. Goal programming is a technique that allows for multiple objective metrics to be targeted at the same time.

The ‘mindset’ of goal programming is fundamentally different from plain vanilla LP problems. Basic LP searches for ways to get as little or as much of a single metric as possible — e.g., maximize profit or minimize waste — subject to constraints. Often conflicting priorities will be found in the objective function or the constraints. For example, maximize profit (objective) subject to a maximum amount of waste (constraint). With goal programming, we can move important constraint metrics into the objective function so we can optimize on them rather than just constraining them. We can maximize profit and minimize waste at the same time!

Now is a good time to establish the example we will explore for the rest of the article:

Let’s imagine that we manage a factory that makes clothing. Our factory can make pants, shirts, and dresses. Each article of clothing has costs, profit, and waste associated with their production. We want to create a production plan that will make a certain level of profit and also waste below a certain quantity due to an environmental commitment. Let’s say that we want to make $150k a month in profit and we also want to waste less than 20k yards of fabric. In addition to our goals, we can’t spend more than $50k in materials and labor.

The example above sounds a lot like a regular linear programming problem right? Well, the twist is that we can’t make $150k in profit and waste less than 20k of yards at the same time. In other words, there would be no feasible solution if we were to plug this into a regular linear programming problem. Typically, the goals set in the problems cannot all be achieved with a single solution, otherwise there wouldn’t be a point in using goal programming. We would just use regular old linear programming with our goals as constraints. The real value of goal programming is that it can create a compromise between conflicting goals when regular linear programming would yield an infeasible solution.

The real value of goal programming is that it can create a compromise between conflicting goals when regular linear programming would yield an infeasible solution.

How does goal programming balance and compromise with conflicting goals? There are two popular approaches: (1) weighted and (2) preemptive. We’ll cover these in detail in the following sections.

The weights approach

Here, we’ll dive into the details of the weights approach. The weights method has a single objective function and runs a single Optimization based on (you guessed it) weights! The fact that only one optimization is run under the weights method may seem like a given — but the preemptive method actually runs multiple linear programming optimizations. We’ll get to that in the next section…

The weights method has specific targets or goals for multiple metrics — e.g., make at least $150k in profit selling clothes or waste no more than 20k yards of fabric. For regular LP, we want to fully optimize. For the weights method of goal programming, we want to get as close to hitting goals as possible — after we reach a goal, the optimization doesn’t see more benefit in further maximization or minimization, so it prioritizes hitting the next most important goal. If this seems confusing now, don’t worry it will make more sense as we get into the example.

The objective function for the weights approach is specially formulated to minimize the weighted differences between a metric’s goal and the actual value of the metric. Let’s jump to our example from above — i.e., we want to make $150k in profit and waste less than 20k yards of raw material. Our objective is to minimize how far off we are from both of these goals.

Here is the mathematical formulation for this objective:

Where w1 and w2 are assigned weights and P and W are how far we miss our profit and waste goals respectively

With our objective function set up, we need to define our constraints. We will have two types of constraints (1) goal related constraints and (2) regular linear programming constraints (the same kind of constraints you would find in plain vanilla LP). Let’s talk about the goal related constraints first.

We will need to create two things to set up our goal related constraints, (1) profit and waste functions and (2) multiple slack variables. Let’s go through these one at a time.

The profit and waste functions are very straight forward. They combine our decision variables together to calculate total profit and total waste give a specific solution. Below are the formulas that tie profit and waste to the number of pants, shirts and dresses we produce:

profit and waste functions

With our profit and waste functions established, let’s start talking about our slack variables. In goal programming, slack variables are used to measure how far a solution is from hitting a goal. In our example, the variables P and W are both slack variables — they represent how much lower our profit is compared to our profit goal and how much higher our waste is compared to our waste goal. Slack variables are embedded in constraints. Below are the constraint functions for our profit and waste goals — again, the P’s and the W’s are our slack variables:

P+, P-, W+ and W- are slack variables, profit and waste are functions established with the formulas above

Note that we have plus and minus slack variables — this allows us to miss the goal on either end. We only want to penalize the slack variable that goes in the opposite direction of our goal (e.g., we don’t want to penalize more profit than our goal, we only want to penalize less profit) — that is why only one slack variable per goal is in the objective function. With this new notation, let’s rewrite our objective function:

Objective function with updated slack variable notation

We have now done all of the special work for goal programming. The last thing we need to do is quickly add our plain vanilla budget constraint. We are using a regular constraint for our budget because, in our example, it is a hard constraint. Unlike with profit and waste, we cannot violate the budget.

Regular (not goal programming related) budget constraint

Now, we have a fully specified goal programming problem. Let’s set it up in Python and solve!

# $150,000 in profit
problem += profit + profit_deviation_neg - profit_deviation_pos == 150000, "Profit_Goal"

# Waste goal: No more than 20,000 yards of waste
problem += waste + waste_deviation_neg - waste_deviation_pos == 20000, "Cost_Goal"

# Budget constraint
problem += cost <= 50000

# Solve the problem
problem.solve()

# Display the results
print("Status:", pulp.LpStatus[problem.status])
print("Pants produced:", pulp.value(pants))
print("Shirts produced:", pulp.value(shirts))
print("Dresses produced:", pulp.value(dresses))
print("Cost :", pulp.value(cost))
print("Profit :", pulp.value(profit))
print("Profit deviation (positive):", pulp.value(profit_deviation_pos))
print("Profit deviation (negative):", pulp.value(profit_deviation_neg))
print("Waste :", pulp.value(waste))
print("Waste deviation (positive):", pulp.value(waste_deviation_pos))
print("Waste deviation (negative):", pulp.value(waste_deviation_neg))

This optimization recommends we make 0 pants, 5,000 shirts and 0 dresses. We make $150k in profit which matches our goal exactly and we waste 50k yards of fabric which exceeds our max waste by 30k yards. The full results are printed by the code and shown below:

results of optimization run with equal weights

Now that we’ve got the basic structure of the weights approach down, let’s actually talk about the weights! In our first example, we gave equal weights to a dollar of profit and to a yard of waste. This probably doesn’t make a lot of sense because they are different units. Setting the weights is a subjective decision to be made by the practitioner. For our example, we’ll decide that wasting 1.5 yards of fabric is as bad as making $1 of profit is good. In other words, we’ll increase the weight of fabric waste to 1.5 in our objective function.

problem += profit_deviation_neg + 1.5*waste_deviation_pos

The optimization with the updated rates recommends we make about 8,572 pants, 7,143 shirts and 0 dresses. With this solution, we generate $107k in profit (which is a goal miss of $43k) and we waste 20,000 yards of fabric which matches our goal exactly. The full results are printed by the code and shown below:

Results of optimization run with 1.5 weight on fabric waste

Clearly, shifting the weights of the goals can have large impact on the optimization results. We need to carefully set our weights to adequately balance the relative importance of our goals!

Now that we have a solid understanding of how the weighted approach works, let’s shift to talking about the goal programming with the preemptive approach.

Preemptive Approach

While the weights method balances goals using weights in the objective function, the preemptive method gives hierarchical priority to goals through iterative optimizations. That’s a lot of words, don’t worry, we’ll break it down!

Here are the steps to the preemptive approach:

1. Run a regular linear programming optimization on your first goal — e.g., maximize profit
2. Save the objective value from that run
3. Run another regular linear programming on the next most important goal — e.g., minimize waste — but, add the objective value from the last run as a constraint
4. Repeat the process until you have gone through all goal metrics

Two important features of the preemptive method are (1) it prioritizes goals by rank and (2) the objective value of a higher importance goal cannot be decreased (because of the hard constraint) when optimizing lower priority goals. Let’s go over an example to build intuition.

For our example, let’s say that profit is the most important goal and minimizing waste is the second. We’ll start out by running a plain vanilla optimization that maximizes profit:

# Define the problem
problem = pulp.LpProblem("Clothing_Company_Goal_Programming", pulp.LpMaximize)

# Decision variables: number of pants, shirts, and dresses produced
pants = pulp.LpVariable('pants', lowBound=0, cat='Continuous')
shirts = pulp.LpVariable('shirts', lowBound=0, cat='Continuous')
dresses = pulp.LpVariable('dresses', lowBound=0, cat='Continuous')

# Profit and cost coefficients
profit = 10 * pants + 3 * shirts + 15 * dresses
cost = 5 * pants + 1 * shirts + 10 * dresses
waste = 1.5 * pants + 1 * shirts + 3 * dresses

# Objective function: Maximize profit
problem += profit

# Constraints
# Budget constraint
problem += cost <= 50000

# Solve the problem
problem.solve()

# Display the results
print("Status:", pulp.LpStatus[problem.status])
print("Pants produced:", pulp.value(pants))
print("Shirts produced:", pulp.value(shirts))
print("Dresses produced:", pulp.value(dresses))
print("Cost :", pulp.value(cost))
print("Profit :", pulp.value(profit))

The results of the profit maximizing LP problem are below:

Profit maximization

So, our objective function says to make 50k shirts and collect a profit of $150k. This was only the first optimization we are going to run though! Following the algorithm outlined above, we will now run another LP that minimizes waste but, we will add a constraint of profit ≥ $150k to ensure we don’t get a worse profit.

# Define the problem
problem = pulp.LpProblem("Clothing_Company_Goal_Programming", pulp.LpMinimize)

# Decision variables: number of pants, shirts, and dresses produced
pants = pulp.LpVariable('pants', lowBound=0, cat='Continuous')
shirts = pulp.LpVariable('shirts', lowBound=0, cat='Continuous')
dresses = pulp.LpVariable('dresses', lowBound=0, cat='Continuous')

# Profit and cost coefficients
profit = 10 * pants + 3 * shirts + 15 * dresses
cost = 5 * pants + 1 * shirts + 10 * dresses
waste = 1.5 * pants + 1 * shirts + 3 * dresses

# Objective function: Minimize the fabric waste
problem += waste

# Budget constraint
problem += cost <= 50000

problem += profit >= 150000

# Solve the problem
problem.solve()

# Display the results
print("Status:", pulp.LpStatus[problem.status])
print("Pants produced:", pulp.value(pants))
print("Shirts produced:", pulp.value(shirts))
print("Dresses produced:", pulp.value(dresses))
print("Cost :", pulp.value(cost))
print("Profit :", pulp.value(profit))

And here are the results of this final optimization:

results of waste minimizing optimization

The astute observer will notice that the optimization is the exact same . This can often be the case with the preemptive method. The constraint of previously optimized goals can be very restrictive. The only time when iterative optimizations are different is if there are multiple ways to get the optimal values for previous goals. For example, if there were two ways to get $150k in profit; one had more waste and the other had less, our second iteration would return the results of the solution with the lower waste. In the preemptive method, there is no trade off between the goals. Even if there was a solution that made $149k in profit with 0 yards of waste, the preemptive method would always choose $150k in profit with 50k yards of waste. The extra $1000 of profit is infinitely more important than any amount of wasted fabric.

The preemptive method should be used when goals are clearly prioritized, and there is no substitution between them — meaning no amount of success in lower priority goals can compensate for reduced optimization in higher priority ones. When used correctly, the preemptive method can help optimize a primary goal while trying to find good solutions for lower priority goals very effectively.

Conclusion

Goal programming provides a framework that extends traditional linear programming to optimize multiple metrics at the same time. The weighted approach balances priorities via weights in the objective function and is appropriate when objective metrics have relative importance that can be quantified. The preemptive method is an iterative approach that gives priority to metrics based on hierarchy. It is appropriate when some objectives are wholly more important than others. Both approaches can be powerful optimization techniques when applied in the correct context!

Happy optimizing!

Earlier articles in this series:

Part 1: Linear Programming Optimization: Foundations
Part 2: Linear Programming: The Stock Cutting Problem
Part 3: Linear Programming Optimization: The Simplex Method
Part 4: Linear programming: Integer Linear Programming with Branch and Bound
Part 5: Linear Programming: Auxiliary Variables

The post Linear Programming: Managing Multiple Targets with Goal Programming appeared first on Towards Data Science.

The Art of Noise

Muhammad Ardi — Thu, 03 Apr 2025 01:12:22 +0000

Introduction

In my last several articles I talked about generative deep learning algorithms, which mostly are related to text generation tasks. So, I think it would be interesting to switch to generative algorithms for image generation now. We knew that nowadays there have been plenty of deep learning models specialized for generating images out there, such as Autoencoder, Variational Autoencoder (VAE), Generative Adversarial Network (GAN) and Neural Style Transfer (NST). I actually got some of my writings about these topics posted on Medium as well. I provide you the links at the end of this article if you want to read them.

In today’s article, I would like to discuss the so-called diffusion model — one of the most impactful models in the field of deep learning for image generation. The idea of this algorithm was first proposed in the paper titled Deep Unsupervised Learning using Nonequilibrium Thermodynamics written by Sohl-Dickstein et al. back in 2015 [1]. Their framework was then developed further by Ho et al. in 2020 in their paper titled Denoising Diffusion Probabilistic Models [2]. DDPM was later adapted by OpenAI and Google to develop DALLE-2 and Imagen, which we knew that these models have impressive capabilities to generate high-quality images.

How Diffusion Model Works

Generally speaking, diffusion model works by generating image from noise. We can think of it like an artist transforming a splash of paint on a canvas into a beautiful artwork. In order to do so, the diffusion model needs to be trained first. There are two main steps required to be followed to train the model, namely forward diffusion and backward diffusion.

Figure 1. The forward and backward diffusion process [3].

As you can see in the above figure, forward diffusion is a process where Gaussian noise is applied to the original image iteratively. We keep adding the noise until the image is completely unrecognizable, at which point we can say that the image now lies in the latent space. Different from Autoencoders and GANs where the latent space typically has a lower dimension than the original image, the latent space in DDPM maintains the exact same dimensionality as the original one. This noising process follows the principle of a Markov Chain, meaning that the image at timestep t is affected only by timestep t-1. Forward diffusion is considered easy since what we basically do is just adding some noise step by step.

The second training phase is called backward diffusion, which our objective here is to remove the noise little by little until we obtain a clear image. This process follows the principle of the reverse Markov Chain, where the image at timestep t-1 can only be obtained based on the image at timestep t. Such a denoising process is really difficult since we need to guess which pixels are noise and which ones belong to the actual image content. Thus, we need to employ a neural network model to do so.

DDPM uses U-Net as the basis of the deep learning architecture for backward diffusion. However, instead of using the original U-Net model [4], we need to make several modifications to it so that it will be more suitable for our task. Later on, I am going to train this model on the MNIST Handwritten Digit dataset [5], and we will see whether it can generate similar images.

Well, that was pretty much all the fundamental concepts you need to know about diffusion models for now. In the next sections we are going to get even deeper into the details while implementing the algorithm from scratch.

PyTorch Implementation

We are going to start by importing the required modules. In case you’re not yet familiar with the imports below, both torch and torchvision are the libraries we’ll use for preparing the model and the dataset. Meanwhile, matplotlib and tqdm will help us display images and progress bars.

# Codeblock 1
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from tqdm import tqdm

As the modules have been imported, the next thing to do is to initialize some config parameters. Look at the Codeblock 2 below for the details.

# Codeblock 2
IMAGE_SIZE     = 28     #(1)
NUM_CHANNELS   = 1      #(2)

BATCH_SIZE     = 2
NUM_EPOCHS     = 10
LEARNING_RATE  = 0.001

NUM_TIMESTEPS  = 1000   #(3)
BETA_START     = 0.0001 #(4)
BETA_END       = 0.02   #(5)
TIME_EMBED_DIM = 32     #(6)
DEVICE = torch.device("cuda" if torch.cuda.is_available else "cpu")  #(7)
DEVICE

# Codeblock 2 Output
device(type='cuda')

At the lines marked with #(1) and #(2) I set IMAGE_SIZE and NUM_CHANNELS to 28 and 1, which these numbers are obtained from the image dimension in the MNIST dataset. The BATCH_SIZE, NUM_EPOCHS, and LEARNING_RATE variables are pretty straightforward, so I don’t think I need to explain them further.

At line #(3), the variable NUM_TIMESTEPS denotes the number of iterations in the forward and backward diffusion process. Timestep 0 is the condition where the image is in its original state (the leftmost image in Figure 1). In this case, since we set this parameter to 1000, timestep number 999 is going to be the condition where the image is completely unrecognizable (the rightmost image in Figure 1). It is important to keep in mind that the choice of the number of timesteps involves a tradeoff between model accuracy and computational cost. If we assign a small value for NUM_TIMESTEPS, the inference time is going to be shorter, yet the resulting image might not be really good since the model has fewer steps to refine the image in the backward diffusion stage. On the other hand, increasing NUM_TIMESTEPS will slow down the inference process, but we can expect the output image to have better quality thanks to the gradual denoising process which results in a more precise reconstruction.

Next, the BETA_START (#(4)) and BETA_END (#(5)) variables are used to control the amount of Gaussian noise added at each timestep, whereas TIME_EMBED_DIM (#(6)) is employed to determine the feature vector length for storing the timestep information. Lastly, at line #(7) I assign “cuda” to the DEVICE variable if Pytorch detects GPU installed in our machine. I highly recommend you run this project on GPU since training a diffusion model is computationally expensive. In addition to the above parameters, the values set for NUM_TIMESTEPS, BETA_START and BETA_END are all adopted directly from the DDPM paper [2].

The complete implementation will be done in several steps: constructing the U-Net model, preparing the dataset, defining noise scheduler for the diffusion process, training, and inference. We are going to discuss each of those stages in the following sub-sections.

The U-Net Architecture: Time Embedding

As I’ve mentioned earlier, the basis of a diffusion model is U-Net. This architecture is used because its output layer is suitable to represent an image, which definitely makes sense since it was initially introduced for image segmentation task at the first place. The following figure shows what the original U-Net architecture looks like.

Figure 2. The original U-Net model proposed in [4].

However, it is necessary to modify this architecture so that it can also take into account the timestep information. Not only that, since we will only use MNIST dataset, we also need to make the model smaller. Just remember the convention in deep learning that simpler models are often more effective for simple tasks.

In the figure below I show you the entire U-Net model that has been modified. Here you can see that the time embedding tensor is injected to the model at every stage, which will later be done by element-wise summation, allowing the model to capture the timestep information. Next, instead of repeating each of the downsampling and the upsampling stages four times like the original U-Net, in this case we will only repeat each of them twice. Additionally, it is worth noting that the stack of downsampling stages is also known as the encoder, whereas the stack of upsampling stages is often called the decoder.

Figure 3. The modified U-Net model for our diffusion task [3].

Now let’s start constructing the architecture by creating a class for generating the time embedding tensor, which the idea is similar to the positional embedding in Transformer. See the Codeblock 3 below for the details.

# Codeblock 3
class TimeEmbedding(nn.Module):
    def forward(self):
        time = torch.arange(NUM_TIMESTEPS, device=DEVICE).reshape(NUM_TIMESTEPS, 1)  #(1)
        print(f"time\t\t: {time.shape}")
          
        i = torch.arange(0, TIME_EMBED_DIM, 2, device=DEVICE)
        denominator = torch.pow(10000, i/TIME_EMBED_DIM)
        print(f"denominator\t: {denominator.shape}")
          
        even_time_embed = torch.sin(time/denominator)  #(1)
        odd_time_embed  = torch.cos(time/denominator)  #(2)
        print(f"even_time_embed\t: {even_time_embed.shape}")
        print(f"odd_time_embed\t: {odd_time_embed.shape}")
          
        stacked = torch.stack([even_time_embed, odd_time_embed], dim=2)  #(3)
        print(f"stacked\t\t: {stacked.shape}")
        time_embed = torch.flatten(stacked, start_dim=1, end_dim=2)  #(4)
        print(f"time_embed\t: {time_embed.shape}")
          
        return time_embed

What we basically do in the above code is to create a tensor of size NUM_TIMESTEPS × TIME_EMBED_DIM (1000×32), where every single row of this tensor will contain the timestep information. Later on, each of the 1000 timesteps will be represented by a feature vector of length 32. The values in the tensor themselves are obtained based on the two equations in Figure 4. In the Codeblock 3 above, these two equations are implemented at line #(1) and #(2), each forming a tensor having the size of 1000×16. Next, these tensors are combined using the code at line #(3) and #(4).

Here I also print out every single step done in the above codeblock so that you can get a better understanding of what is actually being done in the TimeEmbedding class. If you still want more explanation about the above code, feel free to read my previous post about Transformer which you can access through the link at the end of this article. Once you clicked the link, you can just scroll all the way down to the Positional Encoding section.

Figure 4. The sinusoidal positional encoding formula from the Transformer paper [6].

Now let’s check if the TimeEmbedding class works properly using the following testing code. The resulting output shows that it successfully produced a tensor of size 1000×32, which is exactly what we expected earlier.

# Codeblock 4
time_embed_test = TimeEmbedding()
out_test = time_embed_test()

# Codeblock 4 Output
time            : torch.Size([1000, 1])
denominator     : torch.Size([16])
even_time_embed : torch.Size([1000, 16])
odd_time_embed  : torch.Size([1000, 16])
stacked         : torch.Size([1000, 16, 2])
time_embed      : torch.Size([1000, 32])

The U-Net Architecture: DoubleConv

If you take a closer look at the modified architecture, you will see that we actually got lots of repeating patterns, such as the ones highlighted in yellow boxes in the following figure.

Figure 5. The processes done inside the yellow boxes will be implemented in the DoubleConv class [3].

These five yellow boxes share the same structure, where they consist of two convolution layers with the time embedding tensor injected right after the first convolution operation is performed. So, what we are going to do now is to create another class named DoubleConv to reproduce this structure. Look at the Codeblock 5a and 5b below to see how I do that.

# Codeblock 5a
class DoubleConv(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        super().__init__()
        
        self.conv_0 = nn.Conv2d(in_channels=in_channels,  #(2)
                                out_channels=out_channels, 
                                kernel_size=3, 
                                bias=False, 
                                padding=1)
        self.bn_0 = nn.BatchNorm2d(num_features=out_channels)  #(3)
        
        self.time_embedding = TimeEmbedding()  #(4)
        self.linear = nn.Linear(in_features=TIME_EMBED_DIM,  #(5)
                                out_features=out_channels)
        
        self.conv_1 = nn.Conv2d(in_channels=out_channels,  #(6)
                                out_channels=out_channels, 
                                kernel_size=3, 
                                bias=False, 
                                padding=1)
        self.bn_1 = nn.BatchNorm2d(num_features=out_channels)  #(7)
        
        self.relu = nn.ReLU(inplace=True)  #(8)

The two inputs of the __init__() method above gives us flexibility to configure the number of input and output channels (#(1)) so that the DoubleConv class can be used to instantiate all the five yellow boxes simply by adjusting its input arguments. As the name suggests, here we initialize two convolution layers (line #(2) and #(6)), each followed by a batch normalization layer and a ReLU activation function. Keep in mind that the two normalization layers need to be initialized separately (line #(3) and #(7)) since each of them has their own trainable normalization parameters. Meanwhile, the ReLU activation function should only be initialized once (#(8)) because it contains no parameters, allowing it to be used multiple times in different parts of the network. At line #(4), we initialize the TimeEmbedding layer we created earlier, which will later be connected to a standard linear layer (#(5)). This linear layer is responsible to adjust the dimension of the time embedding tensor so that the resulting output can be summed with the output from the first convolution layer in an element-wise manner.

Now let’s take a look at the Codeblock 5b below to better understand the flow of the DoubleConv block. Here you can see that the forward() method accepts two inputs: the raw image x and the timestep information t as shown at line #(1). We initially process the image with the first Conv-BN-ReLU sequence (#(2–4)). This Conv-BN-ReLU structure is typically used when working with CNN-based models, even if the illustration does not explicitly show the batch normalization and the ReLU layers. Apart from the image, we then take the t-th timestep information from our embedding tensor of the corresponding image (#(5)) and pass it through the linear layer (#(6)). We still need to expand the dimension of the resulting tensor using the code at line #(7) before performing element-wise summation at line #(8). Finally, we process the resulting tensor with the second Conv-BN-ReLU sequence (#(9–11)).

# Codeblock 5b
    def forward(self, x, t):  #(1)
        print(f'images\t\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        
        x = self.conv_0(x)  #(2)
        x = self.bn_0(x)    #(3)
        x = self.relu(x)    #(4)
        print(f'\nafter first conv\t: {x.size()}')
        
        time_embed = self.time_embedding()[t]      #(5)
        print(f'\ntime_embed\t\t: {time_embed.size()}')
        
        time_embed = self.linear(time_embed)       #(6)
        print(f'time_embed after linear\t: {time_embed.size()}')
        
        time_embed = time_embed[:, :, None, None]  #(7)
        print(f'time_embed expanded\t: {time_embed.size()}')
        
        x = x + time_embed  #(8)
        print(f'\nafter summation\t\t: {x.size()}')
        
        x = self.conv_1(x)  #(9)
        x = self.bn_1(x)    #(10)
        x = self.relu(x)    #(11)
        print(f'after second conv\t: {x.size()}')
        
        return x

To see if our DoubleConv implementation works properly, we are going to test it with the Codeblock 6 below. Here I want to simulate the very first instance of this block, which corresponds to the leftmost yellow box in Figure 5. To do so, we need to we need to set the in_channels and out_channels parameters to 1 and 64, respectively (#(1)). Next, we initialize two input tensors, namely x_test and t_test. The x_test tensor has the size of 2×1×28×28, representing a batch of two grayscale images having the size of 28×28 (#(2)). Keep in mind that this is just a dummy tensor of random values which will be replaced with the actual images from MNIST dataset later in the training phase. Meanwhile, t_test is a tensor containing the timestep numbers of the corresponding images (#(3)). The values for this tensor are randomly selected between 0 and NUM_TIMESTEPS (1000). Note that the datatype of this tensor must be an integer since the numbers will be used for indexing, as shown at line #(5) back in Codeblock 5b. Lastly, at line #(4) we pass both x_test and t_test tensors to the double_conv_test layer.

By the way, I re-run the previous codeblocks with the print() functions removed prior to running the following code so that the outputs will look neater.

# Codeblock 6
double_conv_test = DoubleConv(in_channels=1, out_channels=64).to(DEVICE)  #(1)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)  #(2)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)  #(3)

out_test = double_conv_test(x_test, t_test)  #(4)

# Codeblock 6 Output
images                  : torch.Size([2, 1, 28, 28])   #(1)
timesteps               : torch.Size([2]), tensor([468, 304], device='cuda:0')  #(2)

after first conv        : torch.Size([2, 64, 28, 28])  #(3)

time_embed              : torch.Size([2, 32])          #(4)
time_embed after linear : torch.Size([2, 64])
time_embed expanded     : torch.Size([2, 64, 1, 1])    #(5)

after summation         : torch.Size([2, 64, 28, 28])  #(6)
after second conv       : torch.Size([2, 64, 28, 28])  #(7)

The shape of our original input tensors can be seen at lines #(1) and #(2) in the above output. Specifically at line #(2), I also print out the two timesteps that we selected randomly. In this example we assume that each of the two images in the x tensor are already noised with the noise level from 468-th and 304-th timesteps prior to being fed into the network. We can see that the shape of the image tensor x changes to 2×64×28×28 after being passed through the first convolution layer (#(3)). Meanwhile, the size of our time embedding tensor becomes 2×32 (#(4)), which is obtained by extracting rows 468 and 304 from the original embedding of size 1000×32. In order to allow element-wise summation to be performed (#(6)), we need to map the 32-dimensional time embedding vectors into 64 and expand their axes, resulting in a tensor of size 2×64×1×1 (#(5)) so that it can be broadcast to the 2×64×28×28 tensor. After the summation is done, we then pass the tensor through the second convolution layer, at which point the tensor dimension does not change at all (#(7)).

The U-Net Architecture: Encoder

As we have successfully implemented the DoubleConv block, the next step to do is to implement the so-called DownSample block. In Figure 6 below, this corresponds to the parts enclosed in the red box.

Figure 6. The parts of the network highlighted in red are the so-called DownSample blocks [3].

The purpose of a DownSample block is to reduce the spatial dimension of an image, but it is important to note that at the same time it increases the number of channels. In order to achieve this, we can simply stack a DoubleConv block and a maxpooling operation. In this case the pooling uses 2×2 kernel size with the stride of 2, causing the spatial dimension of the image to be twice as small as the input. The implementation of this block can be seen in Codeblock 7 below.

# Codeblock 7
class DownSample(nn.Module):
    def __init__(self, in_channels, out_channels):  #(1)
        super().__init__()
        
        self.double_conv = DoubleConv(in_channels=in_channels,  #(2)
                                      out_channels=out_channels)
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)    #(3)
    
    def forward(self, x, t):  #(4)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        
        convolved = self.double_conv(x, t)   #(5)
        print(f'\nafter double conv\t: {convolved.size()}')
        
        maxpooled = self.maxpool(convolved)  #(6)
        print(f'after pooling\t\t: {maxpooled.size()}')
        
        return convolved, maxpooled          #(7)

Here I set the __init__() method to take number of input and output channels so that we can use it for creating the two DownSample blocks highlighted in Figure 6 without needing to write them in separate classes (#(1)). Next, the DoubleConv and the maxpooling layers are initialized at line #(2) and #(3), respectively. Remember that since the DoubleConv block accepts image x and the corresponding timestep t as the inputs, we also need to set the forward() method of this DownSample block such that it accepts both of them as well (#(4)). The information contained in x and t are then combined as the two tensors are processed by the double_conv layer, which the output is stored in the variable named convolved (#(5)). Afterwards, we now actually perform the downsampling with the maxpooling operation at line #(6), producing a tensor named maxpooled. It is important to note that both the convolved and maxpooled tensors are going to be returned, which is essentially done because we will later bring maxpooled to the next downsampling stage, whereas the convolved tensor will be transferred directly to the upsampling stage in the decoder through skip-connections.

Now let’s test the DownSample class using the Codeblock 8 below. The input tensors used here are exactly the same as the ones in Codeblock 6. Based on the resulting output, we can see that the pooling operation successfully converted the output of the DoubleConv block from 2×64×28×28 (#(1)) to 2×64×14×14 (#(2)), indicating that our DownSample class works properly.

# Codeblock 8
down_sample_test = DownSample(in_channels=1, out_channels=64).to(DEVICE)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)

out_test = down_sample_test(x_test, t_test)

# Codeblock 8 Output
original          : torch.Size([2, 1, 28, 28])
timesteps         : torch.Size([2]), tensor([468, 304], device='cuda:0')

after double conv : torch.Size([2, 64, 28, 28])  #(1)
after pooling     : torch.Size([2, 64, 14, 14])  #(2)

The U-Net Architecture: Decoder

We need to introduce the so-called UpSample block in the decoder, which is responsible for reverting the tensor in the intermediate layers to the original image dimension. In order to maintain a symmetrical structure, the number of UpSample blocks must match that of the DownSample blocks. Look at the Figure 7 below to see where the two UpSample blocks are placed.

Figure 7. The components inside the blue boxes are the so-called UpSample blocks [3].

Since both UpSample blocks are structurally identical, we can just initialize a single class for them, just like the DownSample class we created earlier. Look at the Codeblock 9 below to see how I implement it.

# Codeblock 9
class UpSample(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        
        self.conv_transpose = nn.ConvTranspose2d(in_channels=in_channels,  #(1)
                                                 out_channels=out_channels, 
                                                 kernel_size=2, stride=2)  #(2)
        self.double_conv = DoubleConv(in_channels=in_channels,  #(3)
                                      out_channels=out_channels)
        
    def forward(self, x, t, connection):  #(4)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
        print(f'connection\t\t: {connection.size()}')
        
        x = self.conv_transpose(x)  #(5)
        print(f'\nafter conv transpose\t: {x.size()}')
        
        x = torch.cat([x, connection], dim=1)  #(6)
        print(f'after concat\t\t: {x.size()}')
        
        x = self.double_conv(x, t)  #(7)
        print(f'after double conv\t: {x.size()}')
        
        return x

In the __init__() method, we use nn.ConvTranspose2d to upsample the spatial dimension (#(1)). Both the kernel size and stride are set to 2 so that the output will be twice as large (#(2)). Next, the DoubleConv block will be employed to reduce the number of channels, while at the same time combining the timestep information from the time embedding tensor (#(3)).

The flow of this UpSample class is a bit more complicated than the DownSample class. If we take a closer look at the architecture, we’ll see that that we also have a skip-connection coming directly from the encoder. Thus, we need the forward() method to accept another argument in addition to the original image x and the timestep t, namely the residual tensor connection (#(4)). The first thing we do inside this method is to process the original image x with the transpose convolution layer (#(5)). In fact, not only upsampling the spatial size, but this layer also reduces the number of channels at the same time. However, the resulting tensor is then directly concatenated with connection in a channel-wise manner (#(6)), causing it to seem like no channel reduction is performed. It is important to know that at this point these two tensors are just concatenated, meaning that the information from the two are not yet combined. We finally feed these concatenated tensors to the double_conv layer (#(7)), allowing them to share information to each other through the learnable parameters inside the convolution layers.

The Codeblock 10 below shows how I test the UpSample class. The size of the tensors to be passed through are set according to the second upsampling block, i.e., the rightmost blue box in Figure 7.

# Codeblock 10
up_sample_test = UpSample(in_channels=128, out_channels=64).to(DEVICE)

x_test = torch.randn((BATCH_SIZE, 128, 14, 14)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)
connection_test = torch.randn((BATCH_SIZE, 64, 28, 28)).to(DEVICE)

out_test = up_sample_test(x_test, t_test, connection_test)

In the resulting output below, if we compare the input tensor (#(1)) with the final tensor shape (#(2)), we can clearly see that the number of channels successfully reduced from 128 to 64, while at the same time the spatial dimension increased from 14×14 to 28×28. This essentially means that our UpSample class is now ready to be used in the main U-Net architecture.

# Codeblock 10 Output
original             : torch.Size([2, 128, 14, 14])   #(1)
timesteps            : torch.Size([2]), tensor([468, 304], device='cuda:0')
connection           : torch.Size([2, 64, 28, 28])

after conv transpose : torch.Size([2, 64, 28, 28])
after concat         : torch.Size([2, 128, 28, 28])
after double conv    : torch.Size([2, 64, 28, 28])    #(2)

The U-Net Architecture: Putting All Components Together

Once all U-Net components have been created, what we are going to do next is to wrap them together into a single class. Look at the Codeblock 11a and 11b below for the details.

# Codeblock 11a
class UNet(nn.Module):
    def __init__(self):
        super().__init__()
      
        self.downsample_0 = DownSample(in_channels=NUM_CHANNELS,  #(1)
                                       out_channels=64)
        self.downsample_1 = DownSample(in_channels=64,            #(2)
                                       out_channels=128)
      
        self.bottleneck   = DoubleConv(in_channels=128,           #(3)
                                       out_channels=256)
      
        self.upsample_0   = UpSample(in_channels=256,             #(4)
                                     out_channels=128)
        self.upsample_1   = UpSample(in_channels=128,             #(5)
                                     out_channels=64)
      
        self.output = nn.Conv2d(in_channels=64,                   #(6)
                                out_channels=NUM_CHANNELS,
                                kernel_size=1)

You can see in the __init__() method above that we initialize two downsampling (#(1–2)) and two upsampling (#(4–5)) blocks, which the number of input and output channels are set according to the architecture shown in the illustration. There are actually two additional components I haven’t explained yet, namely the bottleneck (#(3)) and the output layer (#(6)). The former is essentially just a DoubleConv block, which acts as the main connection between the encoder and the decoder. Look at the Figure 8 below to see which components of the network belong to the bottleneck layer. Next, the output layer is a standard convolution layer which is responsible to turn the 64-channel image produced by the last UpSampling stage into 1-channel only. This operation is done using a kernel of size 1×1, meaning that it combines information across all channels while operating independently at each pixel position.

Figure 8. The bottleneck layer (the lower part of the model) acts as the main bridge between the encoder and the decoder of U-Net [3].

I guess the forward() method of the entire U-Net in the following codeblock is pretty straightforward, as what we essentially do here is pass the tensors from one layer to another — just don’t forget to include the skip connections between the downsampling and upsampling blocks.

# Codeblock 11b
    def forward(self, x, t):  #(1)
        print(f'original\t\t: {x.size()}')
        print(f'timesteps\t\t: {t.size()}, {t}')
            
        convolved_0, maxpooled_0 = self.downsample_0(x, t)
        print(f'\nmaxpooled_0\t\t: {maxpooled_0.size()}')
            
        convolved_1, maxpooled_1 = self.downsample_1(maxpooled_0, t)
        print(f'maxpooled_1\t\t: {maxpooled_1.size()}')
            
        x = self.bottleneck(maxpooled_1, t)
        print(f'after bottleneck\t: {x.size()}')
    
        upsampled_0 = self.upsample_0(x, t, convolved_1)
        print(f'upsampled_0\t\t: {upsampled_0.size()}')
            
        upsampled_1 = self.upsample_1(upsampled_0, t, convolved_0)
        print(f'upsampled_1\t\t: {upsampled_1.size()}')
            
        x = self.output(upsampled_1)
        print(f'final output\t\t: {x.size()}')
            
        return x

Now let’s see whether we have correctly constructed the U-Net class above by running the following testing code.

# Codeblock 12
unet_test = UNet().to(DEVICE)

x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)

out_test = unet_test(x_test, t_test)

# Codeblock 12 Output
original         : torch.Size([2, 1, 28, 28])   #(1)
timesteps        : torch.Size([2]), tensor([468, 304], device='cuda:0')

maxpooled_0      : torch.Size([2, 64, 14, 14])  #(2)
maxpooled_1      : torch.Size([2, 128, 7, 7])   #(3)
after bottleneck : torch.Size([2, 256, 7, 7])   #(4)
upsampled_0      : torch.Size([2, 128, 14, 14])
upsampled_1      : torch.Size([2, 64, 28, 28])
final output     : torch.Size([2, 1, 28, 28])   #(5)

We can see in the above output that the two downsampling stages successfully converted the original tensor of size 1×28×28 (#(1)) into 64×14×14 (#(2)) and 128×7×7 (#(3)), respectively. This tensor is then passed through the bottleneck layer, causing its number of channels to expand to 256 without changing the spatial dimension (#(4)). Lastly, we upsample the tensor twice before eventually shrinking the number of channels to 1 (#(5)). Based on this output, it looks like our model is working properly. Thus, it is now ready to be trained for our diffusion task.

Dataset Preparation

As we have successfully created the entire U-Net architecture, the next thing to do is to prepare the MNIST Handwritten Digit dataset. Before actually loading it, we need to define the preprocessing steps first using the transforms.Compose() method from Torchvision, as shown at line #(1) in Codeblock 13. There are two things we do here: converting the images into PyTorch tensors which also scales the pixel values from 0–255 to 0–1 (#(2)), and normalize them so that the final pixel values ranging between -1 and 1 (#(3)). Next, we download the dataset using datasets.MNIST(). In this case, we are going to take the images from the training data, hence we use train=True (#(5)). Don’t forget to pass the transform variable we initialized earlier to the transform parameter (transform=transform) so that it will automatically preprocess the images as we load them (#(6)). Lastly, we need to employ DataLoader to load the images from mnist_dataset (#(7)). The arguments I use for the input parameters are intended to randomly pick BATCH_SIZE (2) images from the dataset in each iteration.

# Codeblock 13
transform = transforms.Compose([  #(1)
    transforms.ToTensor(),        #(2)
    transforms.Normalize((0.5,), (0.5,))  #(3)
])

mnist_dataset = datasets.MNIST(   #(4)
    root='./data', 
    train=True,           #(5)
    download=True, 
    transform=transform   #(6)
)

loader = DataLoader(mnist_dataset,  #(7)
                    batch_size=BATCH_SIZE,
                    drop_last=True, 
                    shuffle=True)

In the following codeblock, I try to load a batch of images from the dataset. In every iteration, loader provides both the images and the corresponding labels, hence we need to store them in two separate variables: images and labels.

# Codeblock 14
images, labels = next(iter(loader))

print('images\t\t:', images.shape)
print('labels\t\t:', labels.shape)
print('min value\t:', images.min())
print('max value\t:', images.max())

We can see in the resulting output below that the images tensor has the size of 2×1×28×28 (#(1)), indicating that two grayscale images of size 28×28 have been successfully loaded. Here we can also see that the length of the labels tensor is 2, which matches the number of the loaded images (#(2)). Note that in this case the labels are going to be completely ignored. My plan here is that I just want the model to generate any number it previously seen from the entire training dataset without even knowing what number it actually is. Lastly, this output also shows that the preprocessing works properly, as the pixel values now range between -1 and 1.

# Codeblock 14 Output
images    : torch.Size([2, 1, 28, 28])  #(1)
labels    : torch.Size([2])             #(2)
min value : tensor(-1.)
max value : tensor(1.)

Run the following code if you want to see what the image we just loaded looks like.

# Codeblock 15   
plt.imshow(images[0].squeeze(), cmap='gray')
plt.show()

Figure 9. Output from Codeblock 15 [3].

Noise Scheduler

In this section we are going to talk about how the forward and backward diffusion are performed, which the process essentially involves adding or removing noise little by little at each timestep. It is necessary to know that we basically want a uniform amount of noise across all timesteps, where in the forward diffusion the image should be completely full of noise exactly at timestep 1000, while in the backward diffusion, we have to get the completely clear image at timestep 0. Hence, we need something to control the noise amount for each timestep. Later in this section, I am going to implement a class named NoiseScheduler to do so. — This will probably be the most mathy section of this article, as I’ll display many equations here. But don’t worry about that since we’ll focus on implementing these equations rather than discussing the mathematical derivations.

Now let’s take a look at the equations in Figure 10 which I will implement in the __init__() method of the NoiseScheduler class below.

Figure 10. The equations we need to implement in the __init__() method of the NoiseScheduler class [3].

# Codeblock 16a
class NoiseScheduler:
    def __init__(self):
        self.betas = torch.linspace(BETA_START, BETA_END, NUM_TIMESTEPS)  #(1)
        self.alphas = 1. - self.betas
        self.alphas_cum_prod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cum_prod = torch.sqrt(self.alphas_cum_prod)
        self.sqrt_one_minus_alphas_cum_prod = torch.sqrt(1. - self.alphas_cum_prod)

The above code works by creating multiple sequences of numbers, all of them are basically controlled by BETA_START (0.0001), BETA_END (0.02), and NUM_TIMESTEPS (1000). The first sequence we need to instantiate is the betas itself, which is done using torch.linspace() (#(1)). What it essentially does is that it generates a 1-dimensional tensor of length 1000 starting from 0.0001 to 0.02, where every single element in this tensor corresponds to a single timestep. The interval between each element is uniform, allowing us to generate uniform amount of noise throughout all timesteps as well. With this betas tensor, we then compute alphas, alphas_cum_prod, sqrt_alphas_cum_prod and sqrt_one_minus_alphas_cum_prod based on the four equations in Figure 10. Later on, these tensors will act as the basis of how the noise is generated or removed during the diffusion process.

Diffusion is normally done in a sequential manner. However, the forward diffusion process is deterministic, hence we can derive the original equation into a closed form so that we can obtain the noise at a specific timestep without having to iteratively add noise from the very beginning. The Figure 11 below shows what the closed form of the forward diffusion looks like, where x₀ represents the original image while epsilon (ϵ) denotes an image made up of random Gaussian noise. We can think of this equation as a weighted combination, where we combine the clear image and the noise according to weights determined by the timestep, resulting in an image with a specific amount of noise.

Figure 11. The closed form of the forward diffusion process [3].

The implementation of this equation can be seen in Codeblock 16b. In this forward_diffusion() method, x₀ and ϵ are denoted as original and noise. Here you need to keep in mind that these two input variables are images, whereas sqrt_alphas_cum_prod_t and sqrt_one_minus_alphas_cum_prod_t are scalars. Thus, we need to adjust the shape of these two scalars (#(1) and #(2)) so that the operation at line #(3) can be performed. The noisy_image variable is going to be the output of this function, which I guess the name is self-explanatory.

# Codeblock 16b
    def forward_diffusion(self, original, noise, t):
        sqrt_alphas_cum_prod_t = self.sqrt_alphas_cum_prod[t]
        sqrt_alphas_cum_prod_t = sqrt_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1)  #(1)
        
        sqrt_one_minus_alphas_cum_prod_t = self.sqrt_one_minus_alphas_cum_prod[t]
        sqrt_one_minus_alphas_cum_prod_t = sqrt_one_minus_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1)  #(2)
        
        noisy_image = sqrt_alphas_cum_prod_t * original + sqrt_one_minus_alphas_cum_prod_t * noise  #(3)
        
        return noisy_image

Now let’s talk about backward diffusion. In fact, this one is a bit more complicated than the forward diffusion since we need three more equations here. Before I give you these equations, let me show you the implementation first. See the Codeblock 16c below.

# Codeblock 16c
    def backward_diffusion(self, current_image, predicted_noise, t):  #(1)
        denoised_image = (current_image - (self.sqrt_one_minus_alphas_cum_prod[t] * predicted_noise)) / self.sqrt_alphas_cum_prod[t]  #(2)
        denoised_image = 2 * (denoised_image - denoised_image.min()) / (denoised_image.max() - denoised_image.min()) - 1  #(3)
        
        current_prediction = current_image - ((self.betas[t] * predicted_noise) / (self.sqrt_one_minus_alphas_cum_prod[t]))  #(4)
        current_prediction = current_prediction / torch.sqrt(self.alphas[t])  #(5)
        
        if t == 0:  #(6)
            return current_prediction, denoised_image
        
        else:
            variance = (1 - self.alphas_cum_prod[t-1]) / (1. - self.alphas_cum_prod[t])  #(7)
            variance = variance * self.betas[t]  #(8)
            sigma = variance ** 0.5
            z = torch.randn(current_image.shape).to(DEVICE)
            current_prediction = current_prediction + sigma*z
            
            return current_prediction, denoised_image

Later in the inference phase, the backward_diffusion() method will be called inside a loop that iterates NUM_TIMESTEPS (1000) times, starting from t = 999, continued with t = 998, and so on all the way to t = 0. This function is responsible to remove the noise from the image iteratively based on the current_image (the image produced by the previous denoising step), the predicted_noise (the noise predicted by U-Net in the previous step), and the timestep information t (#(1)). In each iteration, noise removal is done using the equation shown in Figure 12, which in Codeblock 16c, this corresponds to lines #(4-5).

Figure 12. The equation used for removing noise from the image [3].

As long as we haven’t reached t = 0, we will compute the variance based on the equation in Figure 13 (#(7–8)). This variance will then be used to introduce another controlled noise to simulate the stochasticity in the backward diffusion process since the noise removal equation in Figure 12 is a deterministic approximation. This is essentially also the reason that we don’t calculate the variance once we reached t = 0 (#(6)) since we no longer need to add more noise as the image is completely clear already.

Figure 13. The equation used to calculate variance for introducing controlled noise [3].

Different from current_prediction which aims to estimate the image of the previous timestep (xₜ₋₁), the objective of the denoised_image tensor is to reconstruct the original image (x₀). Thanks to these different objectives, we need a separate equation to compute denoised_image, which can be seen in Figure 14 below. The implementation of the equation itself is written at line #(2–3).

Figure 14. The equation for reconstructing the original image [3].

Now let’s test the NoiseScheduler class we created above. In the following codeblock, I instantiate a NoiseScheduler object and print out the attributes associated with it, which are all computed using the equation in Figure 10 based on the values stored in the betas attribute. Remember that the actual length of these tensors is NUM_TIMESTEPS (1000), but here I only print out the first 6 elements.

# Codeblock 17
noise_scheduler = NoiseScheduler()

print(f'betas\t\t\t\t: {noise_scheduler.betas[:6]}')
print(f'alphas\t\t\t\t: {noise_scheduler.alphas[:6]}')
print(f'alphas_cum_prod\t\t\t: {noise_scheduler.alphas_cum_prod[:6]}')
print(f'sqrt_alphas_cum_prod\t\t: {noise_scheduler.sqrt_alphas_cum_prod[:6]}')
print(f'sqrt_one_minus_alphas_cum_prod\t: {noise_scheduler.sqrt_one_minus_alphas_cum_prod[:6]}')

# Codeblock 17 Output
betas                          : tensor([1.0000e-04, 1.1992e-04, 1.3984e-04, 1.5976e-04, 1.7968e-04, 1.9960e-04])
alphas                         : tensor([0.9999, 0.9999, 0.9999, 0.9998, 0.9998, 0.9998])
alphas_cum_prod                : tensor([0.9999, 0.9998, 0.9996, 0.9995, 0.9993, 0.9991])
sqrt_alphas_cum_prod           : tensor([0.9999, 0.9999, 0.9998, 0.9997, 0.9997, 0.9996])
sqrt_one_minus_alphas_cum_prod : tensor([0.0100, 0.0148, 0.0190, 0.0228, 0.0264, 0.0300])

The above output indicates that our __init__() method works as expected. Next, we are going to test the forward_diffusion() method. If you go back to Figure 16b, you will see that forward_diffusion() accepts three inputs: original image, noise image and the timestep number. Let’s just use the image from the MNIST dataset we loaded earlier for the first input (#(1)) and a random Gaussian noise of the exact same size for the second one (#(2)). Run the Codeblock 18 below to see what these two images look like.

# Codeblock 18
image = images[0]  #(1)
noise = torch.randn_like(image)  #(2)

plt.imshow(image.squeeze(), cmap='gray')
plt.show()
plt.imshow(noise.squeeze(), cmap='gray')
plt.show()

Figure 15. The two images to be used as the original (left) and the noise image (right). The one on the left is the same image I showed earlier in Figure 9 [3].

As we already got the image and the noise ready, what we need to do afterwards is to pass them to the forward_diffusion() method alongside the t. I actually tried to run the Codeblock 19 below multiple times with t = 50, 100, 150, and so on up to t = 300. You can see in Figure 16 that the image becomes less clear as the parameter increases. In this case, the image is going to be completely filled by noise when the t is set to 999.

# Codeblock 19
noisy_image_test = noise_scheduler.forward_diffusion(image.to(DEVICE), noise.to(DEVICE), t=50)

plt.imshow(noisy_image_test[0].squeeze().cpu(), cmap='gray')
plt.show()

Figure 16. The result of the forward diffusion process at t=50, 100, 150, and so on until t=300 [3].

Unfortunately, we cannot test the backward_diffusion() method since this process requires us to have our U-Net model trained. So, let’s just skip this part for now. I’ll show you how we can actually use this function later in the inference phase.

Training

As the U-Net model, MNIST dataset, and the noise scheduler are ready, we can now prepare a function for training. Before we do that, I instantiate the model and the noise scheduler in Codeblock 20 below.

# Codeblock 20
model = UNet().to(DEVICE)
noise_scheduler = NoiseScheduler()

The entire training procedure is implemented in the train() function shown in Codeblock 21. Before doing anything, we first initialize the optimizer and the loss function, which in this case we use Adam and MSE, respectively (#(1–2)). What we basically want to do here is to train the model such that it will be able to predict the noise contained in the input image, which later on, the predicted noise will be used as the basis of the denoising process in the backward diffusion stage. To actually train the model, we first need to perform forward diffusion using the code at line #(6). This noising process will be done on the images tensor (#(3)) using the random noise generated at line #(4). Next, we take random number somewhere between 0 and NUM_TIMESTEPS (1000) for the t (#(5)), which is essentially done because we want our model to see images of varying noise levels as an approach to improve generalization. As the noisy images have been generated, we then pass it through the U-Net model alongside the chosen t (#(7)). The input t here is useful for the model as it indicates the current noise level in the image. Lastly, the loss function we initialized earlier is responsible to compute the difference between the actual noise and the predicted noise from the original image (#(8)). So, the objective of this training is basically to make the predicted noise as similar as possible to the noise we generated at line #(4).

# Codeblock 21
def train():
    optimizer = Adam(model.parameters(), lr=LEARNING_RATE)  #(1)
    loss_function = nn.MSELoss()  #(2)
    losses = []
    
    for epoch in range(NUM_EPOCHS):
        print(f'Epoch no {epoch}')
        
        for images, _ in tqdm(loader):
            
            optimizer.zero_grad()

            images = images.float().to(DEVICE)  #(3)
            noise = torch.randn_like(images)  #(4)
            t = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,))  #(5)

            noisy_images = noise_scheduler.forward_diffusion(images, noise, t).to(DEVICE)  #(6)
            predicted_noise = model(noisy_images, t)  #(7)
            loss = loss_function(predicted_noise, noise)  #(8)
            
            losses.append(loss.item())
            loss.backward()
            optimizer.step()

    return losses

Now let’s run the above training function using the codeblock below. Sit back and relax while waiting the training completes. In my case, I used Kaggle Notebook with Nvidia GPU P100 turned on, and it took around 45 minutes to finish.

# Codeblock 22
losses = train()

If we take a look at the loss graph, it seems like our model learned pretty well as the value is generally decreasing over time with a rapid drop at early stages and a more stable (yet still decreasing) trend in the later stages. So, I think we can expect good results later in the inference phase.

# Codeblock 23
plt.plot(losses)

Figure 17. How the loss value decreases as the training goes [3].

Inference

At this point we have already got our model trained, so we can now perform inference on it. Look at the Codeblock 24 below to see how I implement the inference() function.

# Codeblock 24
def inference():

    denoised_images = []  #(1)
    
    with torch.no_grad():  #(2)
        current_prediction = torch.randn((64, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)  #(3)
        
        for i in tqdm(reversed(range(NUM_TIMESTEPS))):  #(4)
            predicted_noise = model(current_prediction, torch.as_tensor(i).unsqueeze(0))  #(5)
            current_prediction, denoised_image = noise_scheduler.backward_diffusion(current_prediction, predicted_noise, torch.as_tensor(i))  #(6)

            if i%100 == 0:  #(7)
                denoised_images.append(denoised_image)
            
        return denoised_images

At the line marked with #(1) I initialize an empty list which will be used to store the denoising result every 100 timesteps (#(7)). This will later allow us to see how the backward diffusion goes. The actual inference process is encapsulated inside torch.no_grad() (#(2)). Remember that in diffusion models we generate images from a completely random noise, which we assume that these images are initially at t = 999. To implement this, we can simply use torch.randn() as shown at line #(3). Here we initialize a tensor of size 64×1×28×28, indicating that we are about to generate 64 images simultaneously. Next, we write a for loop that iterates backwards starting from 999 to 0 (#(4)). Inside this loop, we feed the current image and the timestep as the input for the trained U-Net and let it predict the noise (#(5)). The actual backward diffusion is then performed at line #(6). At the end of the iteration, we should get new images similar to the ones we have in our dataset. Now let’s call the inference() function in the following codeblock.

# Codeblock 25
denoised_images = inference()

As the inference completed, we can now see what the resulting images look like. The Codeblock 26 below is used to display the first 42 images we just generated.

# Codeblock 26
fig, axes = plt.subplots(ncols=7, nrows=6, figsize=(10, 8))

counter = 0

for i in range(6):
    for j in range(7):
        axes[i,j].imshow(denoised_images[-1][counter].squeeze().detach().cpu().numpy(), cmap='gray')  #(1)
        axes[i,j].get_xaxis().set_visible(False)
        axes[i,j].get_yaxis().set_visible(False)
        counter += 1

plt.show()

Figure 18. The images generated by the diffusion model trained on the MNIST Handwritten Digit dataset [3].

If we take a look at the above codeblock, you can see that the indexer of [-1] at line #(1) indicates that we only display the images from the last iteration (which corresponds to timestep 0). This is the reason that the images you see in Figure 18 are all free from noise. I do acknowledge that this might not be the best of a result since not all the generated images are valid digit numbers. — But hey, this instead indicates that these images are not merely duplicates from the original dataset.

Here we can also visualize the backward diffusion process using the Codeblock 27 below. You can see in the resulting output in Figure 19 that we initially start from a complete random noise, which gradually disappears as we move to the right.

# Codeblock 27
fig, axes = plt.subplots(ncols=10, figsize=(24, 8))

sample_no = 0
timestep_no = 0

for i in range(10):
    axes[i].imshow(denoised_images[timestep_no][sample_no].squeeze().detach().cpu().numpy(), cmap='gray')
    axes[i].get_xaxis().set_visible(False)
    axes[i].get_yaxis().set_visible(False)
    timestep_no += 1

plt.show()

Figure 19. What the image looks like at timestep 900, 800, 700 and so on until timestep 0 [3].

Ending

There are plenty of directions you can go from here. First, you might probably need to tweak the parameter configurations in Codeblock 2 if you want better results. Second, it is also possible to modify the U-Net model by implementing attention layers in addition to the stack of convolution layers we used in the downsampling and the upsampling stages. This does not guarantee you to obtain better results especially for a simple dataset like this, but it’s definitely worth trying. Third, you can also try to use a more complex dataset if you want to challenge yourself.

When it comes to practical applications, there are actually lots of things you can do with diffusion models. The simplest one might be for data augmentation. With diffusion model, we can easily generate new images from a specific data distribution. For example, suppose we are working on an image classification project, but the number of images in the classes are imbalanced. To address this problem, it is possible for us to take the images from the minority class and feed them into a diffusion model. By doing so, we can ask the trained diffusion model to generate a number of samples from that class as many as we want.

And well, that’s pretty much everything about the theory and the implementation of diffusion model. Thanks for reading, I hope you learn something new today!

You can access the code used in this project through this link. Here are also the links to my previous articles about Autoencoder, Variational Autoencoder (VAE), Neural Style Transfer (NST), and Transformer.

References

[1] Jascha Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Arxiv. https://arxiv.org/pdf/1503.03585 [Accessed December 27, 2024].

[2] Jonathan Ho et al. Denoising Diffusion Probabilistic Models. Arxiv. https://arxiv.org/pdf/2006.11239 [Accessed December 27, 2024].

[3] Image created originally by author.

[4] Olaf Ronneberger et al. U-Net: Convolutional Networks for Biomedical
Image Segmentation. Arxiv. https://arxiv.org/pdf/1505.04597 [Accessed December 27, 2024].

[5] Yann LeCun et al. The MNIST Database of Handwritten Digits. https://yann.lecun.com/exdb/mnist/ [Accessed December 30, 2024] (Creative Commons Attribution-Share Alike 3.0 license).

[6] Ashish Vaswani et al. Attention Is All You Need. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed September 29, 2024].

The post The Art of Noise appeared first on Towards Data Science.

AI in Social Research and Polling

Stephanie Kirmer — Wed, 02 Apr 2025 00:42:53 +0000

This month, I’m going to be discussing a really interesting topic that I came across in a recent draft paper by a professor at the University of Maryland named M. R. Sauter. In the paper, they discuss (among other things) the phenomenon of social scientists and pollsters trying to employ AI tools to help overcome some of the challenges in conducting social science human subjects research and polling, and point out some major flaws with these approaches. I had some additional thoughts that were inspired by the topic, so let’s talk about it!

Hi, can I ask you a short series of questions?

Let’s start with a quick discussion of why this would be necessary in the first place. Doing social science research and polling is extraordinarily difficult in the modern day. A huge part of this is simply due to the changes in how people connect and communicate — namely, cellphones — making it incredibly hard to get access to a random sampling of individuals who will participate in your research.

To contextualize this, when I was an undergraduate sociology student almost 25 years ago, in research methods class we were taught that a good way to randomly sample people for large research studies was to just take the area code and 3 digit phone number prefix for an area, and randomly generate selections of four digits to complete them, and call those numbers. In those days, before phone scammers became the bane of all our lives, people would answer and you could ask your research questions. Today, on the other hand, this kind of method for trying to get a representative sample of the public is almost laughable. Scarcely anyone answers calls from unknown numbers in their daily lives, outside of very special situations (like when you’re waiting for a job interviewer to call).

So, what do researchers do now? Today, you can sometimes pay gig workers for poll participation, although Amazon MTurk workers or Upworkers are not necessarily representative of the entire population. The sample you can get will have some bias, which has to be accounted for with sampling and statistical methods. A bigger barrier is that these people’s time and effort costs money, which pollsters and academics are loath to part with (and in the case of academics, increasingly they do not have).

What else? If you’re like me, you’ve probably gotten an unsolicited polling text before as well — these are interesting, because they could be legitimate, or they could be scammers out to get your data or money, and it’s tremendously difficult to tell the difference. As a sociologist, I have a soft spot for doing polls and answering surveys to help other social scientists, and I don’t even trust these to click through, more often than not. They’re also a demand on your time, and many people are too busy even if they trust the source.

The entire industry of polling depends on being able to get a diverse sample of people from all walks of life on the telephone, and convincing them to give you their opinion about something.

Regardless of the attempted solutions and their flaws, this problem matters. The entire industry of polling depends on being able to get a diverse sample of people from all walks of life on the telephone, and convincing them to give you their opinion about something. This is more than just a problem for social scientists doing academic work, because polling is a massive industry unto itself with a lot of money on the line.

Do we really need the humans?

Can AI help with this problem in some way? If we involve generative AI in this task, what might that look like? Before we get to practical ways to attack this, I want to discuss a concept Sauter proposes called “AI imaginaries” — essentially, the narratives and social beliefs we hold about what AI really is, what it can do, and what value it can create. This is hard to pin down, partly because of a “strategic vagueness” about the whole idea of AI. Longtime readers will know I have struggled mightily with figuring out whether and how to even reference the term “AI” because it is such an overloaded and conflicted term.

However, we can all think of potentially problematic beliefs and expectations about AI that we encounter implicitly or explicitly in society, such as the idea that AI is inherently a channel for social progress, or that using AI instead of employing human people for tasks is inherently good, because of “efficiency”. I’ve talked about many of these concepts in my other columns, because I think challenging the accuracy of our assumptions is important to help us suss out what the true contributions of AI to our world can really be. Flawed assumptions can lead us to buy into undeserved hype or overpromising that the tech industry can be unfortunately prone to.

In the context of applying AI to social science research, some of Sauter’s components of the AI imaginary include:

expectations that AI can be relied upon as a source of truth
believing that everything meaningful can be measured or quantified, and
(perhaps most problematically) asserting that there is some equivalency between the output of human intelligence or creativity and the output of AI models

Flawed assumptions can lead us to buy into undeserved hype or overpromising that the tech industry can be unfortunately prone to.

What have they tried?

With this framework if thinking in mind, let’s look at a few of the specific approaches people have taken to fixing the difficulties in finding real human beings to involve in research using AI. Many of these techniques have a common thread in that they give up on trying to actually get access to individuals for the research, and instead just ask LLMs to answer the questions instead.

In one case, an AI startup offers to use LLMs to run your Polling for you, instead of actually asking any people at all. They mimic electoral demographics as closely as possible and build samples almost like “digital twin” entities. (Notably, they were predicting the eventual US general election result wrong in a September 2024 article.)

Sauter cites a number of other Research approaches applying similar techniques, including testing whether the LLM would change its answers to opinion questions when exposed to media with particular leanings or opinions (eg, replicating the effect of news on public opinion), attempting to specifically emulate human subgroups using LLMs, believing that this can overcome algorithmic bias, and testing whether the poll responses of LLMs are distinguishable from human answers to the lay person.

Does it work?

Some defend these strategies by arguing that their LLMs can be made to produce answers that approximately match the results of real human polling, but simultaneously argue that human polling is no longer accurate enough to be usable. This brings up the obvious question of, if the human polling is not trustworthy, how is it trustworthy enough to be the benchmark standard for the LLMs?

Furthermore, if the LLM’s output today can be made to match what we think we know about human opinions, that doesn’t mean that output will continue to match human beliefs or the opinions of the public in the future. LLMs are constantly being retrained and developed, and the dynamics of public opinions and perspectives are fluid and variable. One validation today, even if successful, doesn’t promise anything about another set of questions, on another topic, at another time or in another context. Assumptions about this future dependability are consequences of the fallacious expectation that LLMs can be trusted and relied upon as sources of truth, when that is not now and never has been the purpose of these models.

We should always take a step back and remember what LLMs are built for, and what their actual objectives are. As Sanders et al notes, “LLMs generate a response predicted to be most acceptable to the user on the basis of a training process such as reinforcement learning with human feedback”. They’re trying to estimate the next word that will be appealing to you, based on the prompt you have provided — we should not start to fall into mythologizing that suggests the LLM is doing anything else.

When an LLM produces an unexpected response, it’s essentially because a certain amount of randomness is built in to the model — from time to time, in order to sound more “human” and dynamic, instead of choosing the next word with the highest probability, it’ll choose a different one further down the rankings. This randomness is not based on an underlying belief, or opinion, but is just built in to avoid the text sounding robotic or dull. However, when you use an LLM to replicate human opinions, these become outliers that are absorbed into your data. How does this methodology interpret such responses? In real human polling, the outliers may contain useful information about minority perspectives or the fringes of belief — not the majority, but still part of the population. This opens up a lot of questions about how our interpretation of this artificial data can be conducted, and what inferences we can actually draw.

On synthetic data

This topic overlaps with the broader concept of synthetic data in the AI space. As the quantities of unseen organically human generated content available for training LLMs dwindle, studies have attempted to see whether you could bootstrap your way to better models, namely by making an LLM generate new data, then using that to train on. This fails, causing models to collapse, in a form that Jathan Sadowski named “Habsburg AI”.

What this teaches us is that there is more that differentiates the stuff that LLMs produce from organically human generated content than we can necessarily detect. Something is different about the synthetic stuff, even if we can’t totally identify or measure what it is, and we can tell this is the case because the end results are so drastically different. I’ve talked before about the complications and challenges around human detection of synthetic content, and it’s clear that just because humans may not be able to easily and obviously tell the difference, that doesn’t mean there is none.

[J]ust because humans may not be able to easily and obviously tell the difference, that doesn’t mean there is none.

We might also be tempted by the argument that, well, polling is increasingly unreliable and inaccurate, because we have no more easy, free access to the people we want to poll, so this AI mediated version might be the best we can do. If it’s better than the status quo, what’s wrong with trying?

Is it a good idea?

Whether or not it works, is this the right thing to do? This is the question that most users and developers of such technology don’t take much note of. The tech industry broadly is often guilty of this — we ask whether something is effective, for the immediate objective we have in mind, but we may skip over the question of whether we should do it anyway.

I’ve spent a lot of time recently thinking about why these approaches to polling and research worry me. Sauter makes the argument that this is inherently corrosive to social participation, and I’m inclined to agree in general. There’s something troubling about determining that because people are difficult or expensive to use, that we toss them aside and use technological mimicry to replace them. The validity of this depends heavily on what the task is, and what the broader impact on people and society would be. Efficiency is not the unquestionable good that we might sometimes think.

For one thing, people have increasingly begun to learn that our data (including our opinions) has monetary and social value, and it isn’t outrageous for us to want to get a piece of that value. We’ve been giving our opinions away for free for a long time, but I sense that’s evolving. These days retailers regularly offer discounts and deals in exchange for product reviews, and as I noted earlier, MTurkers and other gig workers can rent out their time and get paid for polling and research projects. In the case of commercial polling, where a good deal of the energy for this synthetic polling comes from, substituting LLMs sometimes feels like a method for making an end run around the pesky human beings who don’t want to contribute to someone else’s profits for free.

If we assume that the LLM can generate accurate polls, we are assuming a state of determinism that runs counter to the democratic project.

But setting this aside, there’s a social message behind these efforts that I don’t think we should minimize. Teaching people that their beliefs and opinions are replaceable with technology sets a precedent that can unintentionally spread. If we assume that the LLM can generate accurate polls, we are assuming a state of determinism that runs counter to the democratic project, and expects democratic choices to be predictable. We may think we know what our peers believe, maybe even just by looking at them or reading their profiles, but in the US, at least, we still operate under a voting model that lets that person have a secret ballot to elect their representation. They are at liberty to make their choice based on any reasoning, or none at all. Presuming that we don’t actually have the free will to change our mind in the privacy of the voting booth just feels dangerous. What’s the argument, if we accept the LLMs instead of real polls, that this can’t be spread to the voting process itself?

I haven’t even touched on the issue of trust that keeps people from honestly responding to polls or research surveys, which is an additional sticking point. Instead of going to the source and really interrogating what it is in our social structure that makes people unwilling to honestly state their sincerely held beliefs to peers, we again see the approach of just throwing up our hands and eliminating people from the process altogether.

Sweeping social problems under an LLM rug

It just seems really troubling that we are considering using LLMs to paper over the social problems getting in our way. It feels similar to a different area I’ve written about, the fact that LLM output replicates and mimics the bigotry and harmful content that it finds in training data. Instead of taking a deeper look at ourselves, and questioning why this is in the organically human created content, some people propose censoring and heavily filtering LLM output, as an attempt to hide this part of our real social world.

I guess it comes down to this: I’m not in favor of resorting to LLMs to avoid trying to solve real social problems. I’m not convinced we’ve really tried, in some cases, and in other cases like the polling, I’m deeply concerned that we’re going to create even more social problems by using this strategy. We have a responsibility to look beyond the narrow scope of the issue we care about at this particular moment, and anticipate cascading externalities that may result.

Read more of my work at www.stephaniekirmer.com.

Graph Neural Networks Part 3: How GraphSAGE Handles Changing Graph Structure

Hennie de Harder — Tue, 01 Apr 2025 06:21:17 +0000

In the previous parts of this series, we looked at Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs). Both architectures work fine, but they also have some limitations! A big one is that for large graphs, calculating the node representations with GCNs and GATs will become v-e-r-y slow. Another limitation is that if the graph structure changes, GCNs and GATs will not be able to generalize. So if nodes are added to the graph, a GCN or GAT cannot make predictions for it. Luckily, these issues can be solved!

In this post, I will explain Graphsage and how it solves common problems of GCNs and GATs. We will train GraphSAGE and use it for graph predictions to compare performance with GCNs and GATs.

New to GNNs? You can start with post 1 about GCNs (also containing the initial setup for running the code samples), and post 2 about GATs.

Two Key Problems with GCNs and GATs

I shortly touched upon it in the introduction, but let’s dive a bit deeper. What are the problems with the previous GNN models?

Problem 1. They don’t generalize

GCNs and GATs struggle with generalizing to unseen graphs. The graph structure needs to be the same as the training data. This is known as transductive learning, where the model trains and makes predictions on the same fixed graph. It is actually overfitting to specific graph topologies. In reality, graphs will change: Nodes and edges can be added or removed, and this happens often in real world scenarios. We want our GNNs to be capable of learning patterns that generalize to unseen nodes, or to entirely new graphs (this is called inductive learning).

Problem 2. They have scalability issues

Training GCNs and GATs on large-scale graphs is computationally expensive. GCNs require repeated neighbor aggregation, which grows exponentially with graph size, while GATs involve (multihead) attention mechanisms that scale poorly with increasing nodes.
In big production recommendation systems that have large graphs with millions of users and products, GCNs and GATs are impractical and slow.

Let’s take a look at GraphSAGE to fix these issues.

GraphSAGE (SAmple and aggreGatE)

GraphSAGE makes training much faster and scalable. It does this by sampling only a subset of neighbors. For super large graphs it’s computationally impossible to process all neighbors of a node (except if you have limitless time, which we all don’t…), like with traditional GCNs. Another important step of GraphSAGE is combining the features of the sampled neighbors with an aggregation function.
We will walk through all the steps of GraphSAGE below.

1. Sampling Neighbors

With tabular data, sampling is easy. It’s something you do in every common machine learning project when creating train, test, and validation sets. With graphs, you cannot select random nodes. This can result in disconnected graphs, nodes without neighbors, etcetera:

Randomly selecting nodes, but some are disconnected. Image by author.

What you can do with graphs, is selecting a random fixed-size subset of neighbors. For example in a social network, you can sample 3 friends for each user (instead of all friends):

Randomly selecting three rows in the table, all neighbors selected in the GCN, three neighbors selected in GraphSAGE. Image by author.

2. Aggregate Information

After the neighbor selection from the previous part, GraphSAGE combines their features into one single representation. There are multiple ways to do this (multiple aggregation functions). The most common types and the ones explained in the paper are mean aggregation, LSTM, and pooling.

With mean aggregation, the average is computed over all sampled neighbors’ features (very simple and often effective). In a formula:

LSTM aggregation uses an LSTM (type of neural network) to process neighbor features sequentially. It can capture more complex relationships, and is more powerful than mean aggregation.

The third type, pool aggregation, applies a non-linear function to extract key features (think about max-pooling in a neural network, where you also take the maximum value of some values).

3. Update Node Representation

After sampling and aggregation, the node combines its previous features with the aggregated neighbor features. Nodes will learn from their neighbors but also keep their own identity, just like we saw before with GCNs and GATs. Information can flow across the graph effectively.

This is the formula for this step:

The aggregation of step 2 is done over all neighbors, and then the feature representation of the node is concatenated. This vector is multiplied by the weight matrix, and passed through non-linearity (for example ReLU). As a final step, normalization can be applied.

4. Repeat for Multiple Layers

The first three steps can be repeated multiple times, when this happens, information can flow from distant neighbors. In the image below you see a node with three neighbors selected in the first layer (direct neighbors), and two neighbors selected in the second layer (neighbors of neighbors).

Selected node with selected neighbors, three in the first layer, two in the second layer. Interesting to note is that one of the neighbors of the nodes in the first step is the selected node, so that one can also be selected when two neighbors are selected in the second step (just a bit harder to visualize). Image by author.

To summarize, the key strengths of GraphSAGE are its scalability (sampling makes it efficient for massive graphs); flexibility, you can use it for Inductive learning (works well when used for predicting on unseen nodes and graphs); aggregation helps with generalization because it smooths out noisy features; and the multi-layers allow the model to learn from far-away nodes.

Cool! And the best thing, GraphSAGE is implemented in PyG, so we can use it easily in PyTorch.

Predicting with GraphSAGE

In the previous posts, we implemented an MLP, GCN, and GAT on the Cora dataset (CC BY-SA). To refresh your mind a bit, Cora is a dataset with scientific publications where you have to predict the subject of each paper, with seven classes in total. This dataset is relatively small, so it might be not the best set for testing GraphSAGE. We will do this anyway, just to be able to compare. Let’s see how well GraphSAGE performs.

Interesting parts of the code I like to highlight related to GraphSAGE:

The NeighborLoader that performs selecting the neighbors for each layer:

from torch_geometric.loader import NeighborLoader

# 10 neighbors sampled in the first layer, 10 in the second layer
num_neighbors = [10, 10]

# sample data from the train set
train_loader = NeighborLoader(
    data,
    num_neighbors=num_neighbors,
    batch_size=batch_size,
    input_nodes=data.train_mask,
)

The aggregation type is implemented in the SAGEConv layer. The default is mean, you can change this to max or lstm:

from torch_geometric.nn import SAGEConv

SAGEConv(in_c, out_c, aggr='mean')

Another important difference is that GraphSAGE is trained in mini batches, and GCN and GAT on the full dataset. This touches the essence of GraphSAGE, because the neighbor sampling of GraphSAGE makes it possible to train in mini batches, we don’t need the full graph anymore. GCNs and GATs do need the complete graph for correct feature propagation and calculation of attention scores, so that’s why we train GCNs and GATs on the full graph.
The rest of the code is similar as before, except that we have one class where all different models are instantiated based on the model_type (GCN, GAT, or SAGE). This makes it easy to compare or make small changes.

This is the complete script, we train 100 epochs and repeat the experiment 10 times to calculate average accuracy and standard deviation for each model:

import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, GCNConv, GATConv
from torch_geometric.datasets import Planetoid
from torch_geometric.loader import NeighborLoader

# dataset_name can be 'Cora', 'CiteSeer', 'PubMed'
dataset_name = 'Cora'
hidden_dim = 64
num_layers = 2
num_neighbors = [10, 10]
batch_size = 128
num_epochs = 100
model_types = ['GCN', 'GAT', 'SAGE']

dataset = Planetoid(root='data', name=dataset_name)
data = dataset[0]
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = data.to(device)

class GNN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers, model_type='SAGE', gat_heads=8):
        super().__init__()
        self.convs = torch.nn.ModuleList()
        self.model_type = model_type
        self.gat_heads = gat_heads

        def get_conv(in_c, out_c, is_final=False):
            if model_type == 'GCN':
                return GCNConv(in_c, out_c)
            elif model_type == 'GAT':
                heads = 1 if is_final else gat_heads
                concat = False if is_final else True
                return GATConv(in_c, out_c, heads=heads, concat=concat)
            else:
                return SAGEConv(in_c, out_c, aggr='mean')

        if model_type == 'GAT':
            self.convs.append(get_conv(in_channels, hidden_channels))
            in_dim = hidden_channels * gat_heads
            for _ in range(num_layers - 2):
                self.convs.append(get_conv(in_dim, hidden_channels))
                in_dim = hidden_channels * gat_heads
            self.convs.append(get_conv(in_dim, out_channels, is_final=True))
        else:
            self.convs.append(get_conv(in_channels, hidden_channels))
            for _ in range(num_layers - 2):
                self.convs.append(get_conv(hidden_channels, hidden_channels))
            self.convs.append(get_conv(hidden_channels, out_channels))

    def forward(self, x, edge_index):
        for conv in self.convs[:-1]:
            x = F.relu(conv(x, edge_index))
        x = self.convs[-1](x, edge_index)
        return x

@torch.no_grad()
def test(model):
    model.eval()
    out = model(data.x, data.edge_index)
    pred = out.argmax(dim=1)
    accs = []
    for mask in [data.train_mask, data.val_mask, data.test_mask]:
        accs.append(int((pred[mask] == data.y[mask]).sum()) / int(mask.sum()))
    return accs

results = {}

for model_type in model_types:
    print(f'Training {model_type}')
    results[model_type] = []

    for i in range(10):
        model = GNN(dataset.num_features, hidden_dim, dataset.num_classes, num_layers, model_type, gat_heads=8).to(device)
        optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

        if model_type == 'SAGE':
            train_loader = NeighborLoader(
                data,
                num_neighbors=num_neighbors,
                batch_size=batch_size,
                input_nodes=data.train_mask,
            )

            def train():
                model.train()
                total_loss = 0
                for batch in train_loader:
                    batch = batch.to(device)
                    optimizer.zero_grad()
                    out = model(batch.x, batch.edge_index)
                    loss = F.cross_entropy(out, batch.y[:out.size(0)])
                    loss.backward()
                    optimizer.step()
                    total_loss += loss.item()
                return total_loss / len(train_loader)

        else:
            def train():
                model.train()
                optimizer.zero_grad()
                out = model(data.x, data.edge_index)
                loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
                loss.backward()
                optimizer.step()
                return loss.item()

        best_val_acc = 0
        best_test_acc = 0
        for epoch in range(1, num_epochs + 1):
            loss = train()
            train_acc, val_acc, test_acc = test(model)
            if val_acc > best_val_acc:
                best_val_acc = val_acc
                best_test_acc = test_acc
            if epoch % 10 == 0:
                print(f'Epoch {epoch:02d} | Loss: {loss:.4f} | Train: {train_acc:.4f} | Val: {val_acc:.4f} | Test: {test_acc:.4f}')

        results[model_type].append([best_val_acc, best_test_acc])

for model_name, model_results in results.items():
    model_results = torch.tensor(model_results)
    print(f'{model_name} Val Accuracy: {model_results[:, 0].mean():.3f} ± {model_results[:, 0].std():.3f}')
    print(f'{model_name} Test Accuracy: {model_results[:, 1].mean():.3f} ± {model_results[:, 1].std():.3f}')

And here are the results:

GCN Val Accuracy: 0.791 ± 0.007
GCN Test Accuracy: 0.806 ± 0.006
GAT Val Accuracy: 0.790 ± 0.007
GAT Test Accuracy: 0.800 ± 0.004
SAGE Val Accuracy: 0.899 ± 0.005
SAGE Test Accuracy: 0.907 ± 0.004

Impressive improvement! Even on this small dataset, GraphSAGE outperforms GAT and GCN easily! I repeated this test for CiteSeer and PubMed datasets, and always GraphSAGE came out best.

What I like to note here is that GCN is still very useful, it’s one of the most effective baselines (if the graph structure allows it). Also, I didn’t do much hyperparameter tuning, but just went with some standard values (like 8 heads for the GAT multi-head attention). In larger, more complex and noisier graphs, the advantages of GraphSAGE become more clear than in this example. We didn’t do any performance testing, because for these small graphs GraphSAGE isn’t faster than GCN.

Conclusion

GraphSAGE brings us very nice improvements and benefits compared to GATs and GCNs. Inductive learning is possible, GraphSAGE can handle changing graph structures quite well. And we didn’t test it in this post, but neighbor sampling makes it possible to create feature representations for larger graphs with good performance.

Optimizing Connections: Mathematical Optimization within Graphs

Graph Neural Networks Part 1. Graph Convolutional Networks Explained

Graph Neural Networks Part 2. Graph Attention Networks vs. GCNs

The post Graph Neural Networks Part 3: How GraphSAGE Handles Changing Graph Structure appeared first on Towards Data Science.

My Learning to Be Hired Again After a Year… Part 2

Amy Ma — Mon, 31 Mar 2025 19:24:25 +0000

This is the second part of “My learning to being hired again after a year… Part I”.

Hard to believe, but it’s been a full year since I published the first part on TDS. And in that time, something beautiful happened. Every so often, someone would leave a comment, highlight a sentence, or send me a message. Most were simple notes like, “Thank you, Amy. Your post helped me.” But those words lit me up. They brightened entire days. They reminded me that I was never truly alone, not during those long months of unemployment, not in the struggle of figuring out who I was without a job title or company name beneath my email signature or Linkedin profile.

Funny enough, those hard days turned out to be some of the most meaningful ones I’ve had. Maybe even more meaningful than my busiest days at work. Because in losing an identity, I found new ones. I didn’t need a job or a title to feel connected. To many of you, I’m just a pretty lazy writer getting back into the groove. And here I am — returning to my writing routine. So, thank you to everyone who reached out. Your messages rank second on my list of happiest things people give me. The first? That’s easy. My daughter Ellie’s three S’s: her smell, her smile, and her surprises.

Enough talk. Let’s get into Part 2. I’ll pick up where I left off — sharing the lessons that helped me get hired again. This time, I’ll also reflect on how those lessons show up in my work and life today. And for those of you curious about the methods from the book Never Search Alone, I’ve got some thoughts on that too. What worked, what didn’t, and how I made it my own.

Knock, Knock: Opportunity’s at the Door — You Won’t Lose a Penny for Trying

A year into working as a Machine Learning Engineer, I can say this was my biggest life lesson.

Here’s the backstory. I’d been working as a data scientist ever since I finished grad school. Over the past 7 years, I’ve built multiple machine learning models, linear regression, neural networks and Xgboost. All solid stuff. But when it came to designing an entire machine learning system from start to finish? That was a different story. I hadn’t really done that. I knew how to develop models, sure. I even had some experience deploying them, but only parts of the process. If you asked me to design, build, and run an entire system end-to-end, I couldn’t say I had that experience.

And the job market? It was changing fast. Companies didn’t want someone who could just build models anymore. Generative AI was handling a lot of the data analysis now. What they really wanted was someone who could take machine learning and use it to solve real business problems, someone who could own the whole process. Meanwhile, I had just been laid off. I had time. So I decided maybe this was the right moment to pivot. Maybe it was time to go for machine learning engineering.

The first thing I did was reach out to people who had already made that move. Two friends said yes. One had gone from data scientist to machine learning engineer. The other was a data scientist, and her husband worked as an MLE at Apple. We ended up having this long phone call for two hours, maybe more. They were kind. And they didn’t sugarcoat anything. Both of them told me it was tough to make the switch. Not impossible, but tough. If you didn’t have MLOps experience or a solid GitHub portfolio to show off, landing a senior MLE job would be really hard. Especially with how competitive things were getting.

That conversation hit hard. I remember feeling my heart pound, like cold water had been poured over my head. I had two options: I could keep chasing data scientist jobs — applied scientist roles at places like Amazon — but there weren’t many out there. Or swallow my pride, let go of seven years of experience as a data scientist and go for an entry-level MLE role. Honestly, neither choice felt great.

It took me two weeks to work through it. Two long long weeks. But in the end, I made up my mind: I’d try for machine learning engineer jobs at least, even if I had to start from the bottom. I got back to my routine and prepped for interviews. During those hard days, I started blogging on Medium and published on TDS to show my technical muscle, sharing my “Courage to Learn ML” series. Ready for a spoiler alert? I ended up with three offers for senior and even staff level machine learning engineering roles. And I had three other final-round interviews lined up that I had to walk away from, because there just wasn’t enough time or energy for me to do them all.

No, none of those offers came from FAANG companies. But I’m more than happy with where I landed. It was worth the try.

Even now, writing this, I can still feel that chill from when my friends told me the odds were slim. And I can still laugh at how panicked I was. Just the other day, I spoke with a friend who’s looking to move from data engineering into MLE. I told him the same thing I learned for myself: You can do it. And if you decide it’s worth trying, don’t get hung up on the odds. Even if it’s a 1% chance, why not see if you’re in that 1%? But if you don’t try at all, you’re 100% in the group that never made it.

For me, the takeaway is simple. Don’t be afraid of probabilities. Even 99.999999% is not 100%. If you’re worried about the outcome, stop thinking about the outcome. Just do it for fun, for your mental health, for the chance to live without regrets.

A Year Later: I use this lesson almost every day. I blog shamelessly, pretending I don’t care about if people really read those. I make those awkward customer service calls, just to see if someone on the other end might actually help me. I even buy a lottery ticket now and then when the jackpot tops a billion dollars. Who knows? I might end up in that 0.0000…001%. And you know what? I recently won $12 on a ticket. So yes — it’s worth trying.

Learning During the Struggle: Don’t Beg for Jobs

This was another hard lesson from my “to be an MLE or not to be” chapter.

When I spoke with those two friends, they made one thing clear. If I wanted to become a machine learning engineer, I needed hands-on experience with MLOps (machine learning operations). The problem? In my past roles, I’d either handed off my models to software engineers for deployment or handled just one small part of the system myself. I knew I had a gap. And my first instinct was to fill it by any means necessary. So I figured, why not get involved in some real projects? Something complex. Something I could proudly add to my resume.

Since I was out of work, I had time. I joined MLOps communities on Slack and Discord. I posted about my background, offered to work for free with any startup or team that needed help. Just to get some experience in exchange. The response? Pretty discouraging. Hardly anyone replied. A few did, but they expected me to work 50+ hours a week… for free and without any working plans. I remember sending a message to a PhD student after reading his job posting. I told him how I liked his work and wanted to make his product a reality. He didn’t get back with me. He instead changed his posting to say he was seeking experienced MLEs or someone with a PhD. Ouch.

After a few weeks of all that, I was demotivated and burned out. I was pleading for opportunities and it was clear. It was then that I decided to join a Job Search Council (JSC) (I explained JSC in detail in the part 1). We shared the emotional weight of job hunting every Friday. I slowly started letting go of the tension. And that’s when something clicked. I needed to stop pleading for jobs. Instead, I decided to sell what I had.

I rewrote my resume into two versions, one for data scientist roles and the other for MLE roles. I applied for MLE jobs crazily just to increase the chances. But this time around, I approached it differently. I broke down what the hiring managers were actually looking for in an MLE. I saw how all the model building experience I had acquired had actually taught me on debugging, monitoring, and resolving messy business problems. While I didn’t have a lot of MLOps experience, I wasn’t coming from zero. I had a master’s degree in computer science, I was familiar with software development, and I knew data engineering.

In those MLE interviews, I started highlighting those skills. I explained how I applied machine learning to solve business problems, offered subtle hints about my favorite model-training tricks. I showed hiring managers I knew how it felt to run systems into production. I was honest about where I needed to gain more experience. But I made it clear this wasn’t a cold start.

At some point, I stopped acting like a job-beggar and became a salesperson. I wasn’t asking someone to “please hire me. I’m willing to work more and cheaper”. I was selling something. When a company didn’t hire me, it wasn’t a rejection. It just meant they didn’t need someone like me. Maybe I need to tighten the pitch next time.

This mental shift made all the difference. Negative feedback wasn’t personal anymore. It was just feedback, a little data point I could use to make adjustments. When you ask for something, people think less of you. But when you treat yourself as a product, you’re refining and searching for the right buyers. If there’s a flaw, you fix it. If there are good things, you point them out. And sooner or later, you find your people.

A Year Later: I don’t beg anymore. Not for jobs. Not for opportunities. I exchange. I sell. That mindset has become part of me now. It’s my inner tiny salesperson.

Mock Interviews and the Interview Marathon: Practice Really Does Make a Difference

I’ll be straight with you. Before I started interviewing for machine learning engineer roles after my layoff, I had never really practiced behavioral interviews. Not once in my seven years of working. Sure, I wrote out a few stories using the STAR method, like everyone says you should. But I never practiced them out loud, and I definitely never got feedback. It was like stepping on stage to perform in a play without ever going to rehearsal. I never realized how big a mistake that was, probably because, back when the job market was good, I didn’t have to.

But after the layoff? After spending nearly a year at home because of pregnancy? The market was ice cold. There weren’t many chances, and I couldn’t afford to blow any of them. I had to nail the behavioral interviews. Not just by memorizing my stories, but by actually practicing. For real.

So, I made my husband do mock interviews with me. I sat in one room, he sat in another, and we jumped on Zoom like it was the real thing. Poor guy — he’s been at the same job since forever and works in a totally different field, but there he was, asking me random behavioral questions. At first, I didn’t think it was going to help. I figured he didn’t get what I did anyway. But when I started answering with my “well-crafted” stories, something surprising happened. I got nervous. And wordy. Way too wordy.

And then he cut me off. Not gently, either. He told me straight up: I was spending way too much time talking about the background. The company, the project, all the setup. He said by the time I got to the part about what I actually did, he had already tuned out. You know what? He was 100% correct and I’d never noticed it before. I never thought about how much time I was wasting on details that didn’t really matter to the person listening.

After that, I went back through my stories. Almost all of them had the same problem. Too much setup, not enough focus on action and results. Honestly? I was grateful for his brutal feedback. It was a little embarrassing, but I wished I’d done mock interviews like that years ago.

From then on, I decided to practice a lot more. With my new MLE resume ready, I started applying like crazy. Interviews came in, and instead of trying to avoid them, I leaned in. Earlier in my career, I was the kind of person who’d grab the first offer just to escape the stress of interviewing. Selling myself has always made me a little panicky. After all, I’m an introvert. But this time, things were different. The book Never Search Alone and those early mock interviews changed my mindset. (I’ll talk more about the book and why it prevents me from rushing out of the interview process later.)

So I gave myself time. I said yes to almost every interview I could get. At one point, I interviewed with four companies over three days. It felt like a marathon, but somewhere along the way, I got good at telling my story. I watched how the interviewers reacted. I collected feedback from the process. And something strange happened: I stopped caring so much about the results. Whether I got a yes or a no didn’t shake me anymore. I wasn’t just interviewing to get a job. I was practicing to get the job I really wanted.

By the time I had three offers on the table and finally chose the one I liked, I knew I was done. That was my finish line. It felt like I’d run the full race and actually won the prize I wanted not the one I settled for.

Seriously, I can’t say this enough: KEEP interviewing. Back-to-back if you can. Do mock interviews with whoever you trust, even if they aren’t in your field. Practice until you’re less worried about the outcome and more focused on getting better.

A Year Later: It’s hard to say how much of those interview skills I still have in me now. But if I ever need to practice again, you better believe I’ll be dragging my husband back into another round of mock interviews. Maybe even for business presentations. He’s a tough crowd, but he gets results :]

Panic Mode? Deep Breath, the Show Must Go On

During my interview marathon, I started noticing something that completely threw me off. Some interviewers looked… disappointed. Others seemed bored. And me? I cared. A lot. Probably too much. Every time I saw a face that wasn’t smiling or nodding, I panicked. In my head, I’d hear this loud voice saying, “Amy, you’re blowing it.” And once that thought crept in, it was over. My brain and body would scramble to fix the situation, so I’d start talking faster, throwing out more words, hoping to change their minds. I wanted to come across as sharp and impressive. But the truth is, I probably looked like a nervous, rambling mess.

My husband confirmed it after one of our mock interviews. He didn’t sugarcoat it. “You’re not even looking at the camera,” he said. “And you seem really tense.” Again, he is the right.

For an introvert like me, fixing this wasn’t easy. But I found two things that helped. So I will share it here.

The first was simple: breathe. Every time I spotted what I thought was a bad reaction, a frown, a yawn, that blank expression that felt like doom, I forced myself to pause. I took a breath. And instead of rushing to say more, I slowed down. Sometimes I even cracked a cold joke. (I’m surprisingly good at bad jokes. It might be my secret talent.) Then I’d apologize for the joke, take another breath, and move on. That little reset worked in two ways. First, it quieted the voice in my head screaming “You’re ruining this!” Secondly, it made the interviewer’s expression change. Maybe they smiled and got the joke. Maybe they just looked confused and didn’t like it. But at least they weren’t bored or disappointed anymore. I’ll take that.

The second thing I did was tape a picture of my daughter right behind the camera. Her big, shiny smile was right there, and every time I glanced at it, I smiled too. Which, by the way, made me look more relaxed and human on camera. Sometimes the interviewer smiled back, and just like that, the energy shifted. I wasn’t panicking anymore. I was back in control. The show was back on.

I started thinking of myself as a salesperson. Or maybe a showman. What do they do when the audience looks tired or distracted? They keep going. They adjust. They bring the energy back. If you’re like me, someone who takes those reactions personally, you need to have a plan. These were my two tricks. You’ll probably find your own. But the point is: don’t panic. Pause. Breathe. No one will notice. And then, get back to the show.

A Year Later: Honestly, this might be the most important skill I picked up during that tough year. I still use it all the time at work. When I’m presenting my work to a room full of people, I slow myself down. I picture myself in a fancy tailcoat, like an old-school showman, selling my ideas to the audience. Sometimes I throw in one of my classic cold jokes to keep things light.

When I wrap up a presentation, I make sure to give people something easy to take with them. I’ll say, “If you’re heading out and want one thing to remember about this project, here’s the punchline.” Then I boil it down to one or two sentences and say it clearly. Loud enough to stick.

I even use this trick in regular conversations, especially the awkward ones. A little pause makes everything less uncomfortable. And more often than not, things turn out better after that moment to reset.

Do the Mnookin Two-Pager exercise: How I Found a Job That Actually Fit Me

I keep mentioning the book Never Search Alone, and there’s a reason for that. When I first heard about it, I was skeptical. As an introvert, the idea of joining a group of strangers to talk about job hunting made me extremely uncertain and nervous.

My first group didn’t go well. There were five of us, but two people refused to follow the process. They were often late or skipped meetings entirely. It was frustrating, and I almost gave up. Instead, I found another group through the Slack community. That time, it clicked. We met every Friday, and kept each other accountable. We helped one another stay sane through the search. It made a huge difference. If you want to know more about how the JSC (Job Search Council) helped me, I wrote about it in part one of this story.

Looking back, another useful thing the book offered was the Mnookin Two-Pager exercise. You sit down and write out what you love in a job, what you hate, and what your career goals are. Simple, but surprisingly powerful. It forced me to get honest with myself. Without it, I probably would have grabbed the very first offer and rushed out of the market, just to be done with it. I’ve done that before. And regretted it.

This time was different. My two pager list kept me grounded. I knew what I wanted and where I wasn’t willing to settle. That’s how I ended up at Disney. The role hits about 85% of what I was hoping for. More importantly, it steers clear of every red flag on my “hard no” list. A year later, I’m still glad I took the time to figure out exactly what I was looking for before saying yes to anything.

Finally! We Made It to the End.

I’m so glad I finally sat down and finished this. Honestly, I’m the kind of person who thinks a lot. But writing things out like this helps me clear my head and hold on to the lessons I actually want to keep.

If you’ve enjoyed reading this, and you want to read more stories from me, or you just want to smile at how bad my jokes are, please keep an eye on my posts on TDS. Or better yet, subscribe to my newsletter where I write more frequently about AI and ML, along with life lessons, parenting, and, of course, a few of my cold jokes.! If you’d like to support my writing, you can also just buy me a coffee on https://ko-fi.com/amyma101!

The post My Learning to Be Hired Again After a Year… Part 2 appeared first on Towards Data Science.

A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration

Daniel Bakkelund — Sat, 29 Mar 2025 03:00:55 +0000

When I talk to [large] organisations that have not yet properly started with Data Science (DS) and Machine Learning (ML), they often tell me that they have to run a data integration project first, because “…all the data is scattered across the organisation, hidden in silos and packed away at odd formats on obscure servers run by different departments.”

While it may be true that the data is hard to get at, running a large data integration project before embarking on the ML part is easily a bad idea. This, because you integrate data without knowing its use — the chances that the data is going to be fit for purpose in some future ML use case is slim, at best.

In this article, I discuss some of the most important drivers and pitfalls for this kind of integration projects, and rather suggest an approach that focuses on optimising value for money in the integration efforts. The short answer to the challenge is [spoiler alert…] to integrate data on a use-case-per-use-case basis, working backwards from the use case to identify exactly the data you need.

A desire for clean and tidy data

It is easy to understand the urge for doing data integration prior to starting on the data science and machine learning challenges. Below, I list four drivers that I often meet. The list is not exhaustive, but covers the most important motivations, as I see it. We will then go through each driver, discussing their merits, pitfalls and alternatives.

Cracking out AI/ML use cases is difficult, and even more so if you don’t know what data is available, and of which quality.
Snooping out hidden-away data and integrating the data into a platform seems like a more concrete and manageable problem to solve.
Many organisations have a culture for not sharing data, and focusing on data sharing and integration first, helps to change this.
From history, we know that many ML projects grind to a halt due to data access issues, and tackling the organisational, political and technical challenges prior to the ML project may help remove these barriers.

There are of course other drivers for data integration projects, such as “single source of truth”, “Customer 360”, FOMO, and the basic urge to “do something now!”. While important drivers for data integration initiatives, I don’t see them as key for ML-projects, and therefore will not discuss these any further in this post.

1. Cracking out AI/ML use cases is difficult,

… and even more so if you don’t know what data is available, and of which quality. This is, in fact, a real Catch-22 problem: you can’t do machine learning without the right data in place, but if you don’t know what data you have, identifying the potentials of machine learning is essentially impossible too. Indeed, it is one of the main challenges in getting started with machine learning in the first place [See “Nobody puts AI in a corner!” for more on that]. But the problem is not solved most effectively by running an initial data discovery and integration project. It is better solved by an awesome methodology, that is well proven in use, and applies to so many different problem areas. It is called talking together. Since this, to a large extent, is the answer to several of the driving urges, we shall spend a few lines on this topic now.

The value of having people talking to each other cannot be overestimated. This is the only way to make a team work, and to make teams across an organisation work together. It is also a very efficient carrier of information about intricate details regarding data, products, services or other contraptions that are made by one team, but to be used by someone else. Compare “Talking Together” to its antithesis in this context: Produce Comprehensive Documentation. Producing self-contained documentation is difficult and expensive. For a dataset to be usable by a third party solely by consulting the documentation, it has to be complete. It must document the full context in which the data must be seen; How was the data captured? What is the generating process? What transformation has been applied to the data in its current form? What is the interpretation of the different fields/columns, and how do they relate? What are the data types and value ranges, and how should one deal with null values? Are there access restrictions or usage restrictions on the data? Privacy concerns? The list goes on and on. And as the dataset changes, the documentation must change too.

Now, if the data is an independent, commercial data product that you provide to customers, comprehensive documentation may be the way to go. If you are OpenWeatherMap, you want your weather data APIs to be well documented — these are true data products, and OpenWeatherMap has built a business out of serving real-time and historical weather data through those APIs. Also, if you are a large organisation and a team finds that it spends so much time talking to people that it would indeed pay off making comprehensive documentation — then you do that. But most internal data products have one or two internal consumers to begin with, and then, comprehensive documentation doesn’t pay off.

On a general note, Talking Together is actually a key factor for succeeding with a transition to AI and Machine Learning altogether, as I write about in “Nobody puts AI in a corner!”. And, it is a cornerstone of agile software development. Remember the Agile Manifesto? We value individuals and interaction over comprehensive documentation, it states. So there you have it. Talk Together.

Also, not only does documentation incur a cost, but you are running the risk of increasing the barrier for people talking together (“read the $#@!!?% documentation”).

Now, just to be clear on one thing: I am not against documentation. Documentation is super important. But, as we discuss in the next section, don’t waste time on writing documentation that is not needed.

2. Snooping out hidden away data and integrating the data into a platform seems as a much more concrete and manageable problem to solve.

Yes, it is. However, the downside of doing this before identifying the ML use case, is that you only solve the “integrating data in a platform” problem. You don’t solve the “gather useful data for the machine learning use case” problem, which is what you want to do. This is another flip side of the Catch-22 from the previous section: if you don’t know the ML use case, then you don’t know what data you need to integrate. Also, integrating data for its own sake, without the data-users being part of the team, requires very good documentation, which we have already covered.

To look deeper into why data integration without the ML-use case in view is premature, we can look at how [successful] machine learning projects are run. At a high level, the output of a machine learning project is a kind of oracle (the algorithm) that answers questions for you. “What product should we recommend for this user?”, or “When is this motor due for maintenance?”. If we stick with the latter, the algorithm would be a function mapping the motor in question to a date, namely the due date for maintenance. If this service is provided through an API, the input can be {“motor-id” : 42} and the output can be {“latest maintenance” : “March 9th 2026”}. Now, this prediction is done by some “system”, so a richer picture of the solution could be something along the lines of

Image by the author.

The key here is that the motor-id is used to obtain further information about that motor from the data mesh in order to do a robust prediction. The required data set is illustrated by the feature vector in the illustration. And exactly which data you need in order to do that prediction is difficult to know before the ML project is started. Indeed, the very precipice on which every ML project balances, is whether the project succeeds in figuring out exactly what information is required to answer the question well. And this is done by trial and error in the course of the ML project (we call it hypothesis testing and feature extraction and experiments and other fancy things, but it’s just structured trial and error).

If you integrate your motor data into the platform without these experiments, how are you going to know what data you need to integrate? Surely, you could integrate everything, and keep updating the platform with all the data (and documentation) to the end of time. But most likely, only a small amount of that data is required to solve the prediction problem. Unused data is waste. Both the effort invested in integrating and documenting the data, as well as the storage and maintenance cost for all time to come. According to the Pareto rule, you can expect roughly 20% of the data to provide 80% of the data value. But it is hard to know which 20% this is prior to knowing the ML use case, and prior to running the experiments.

This is also a caution against just “storing data for the sake of it”. I’ve seen many data hoarding initiatives, where decrees have been passed from top management about saving away all the data possible, because data is the new oil/gold/cash/currency/etc. For a concrete example; a few years back I met with an old colleague, a product owner in the mechanical industry, and they had started collecting all sorts of time series data about their machinery some time ago. One day, they came up with a killer ML use case where they wanted to take advantage of how distributed events across the industrial plant were related. But, alas, when they looked at their time series data, they realised that the distributed machine instances did not have sufficiently synchronised clocks, leading to non-correlatable time stamps, so the planned cross correlation between time series was not feasible after all. Bummer, that one, but a classical example of what happens when you don’t know the use case you are gathering data for.

3. Many organisations have a culture for not sharing data, and focusing on data sharing and integration first, helps to change this culture.

The first part of this sentence is true; there is no doubt that many good initiatives are blocked due to cultural issues in the organisation. Power struggles, data ownership, reluctance to share, siloing etc. The question is whether an organisation wide data integration effort is going to change this. If someone is reluctant to share their data, having a creed from above stating that if you share your data, the world is going to be a better place is probably too abstract to change that attitude.

However, if you interact with this group, include them in the work and show them how their data can help the organisation improve, you are much more likely to win their hearts. Because attitudes are about feelings, and the best way to deal with differences of this kind is (believe it or not) to talk together. The team providing the data has a need to shine, too. And if they are not being invited into the project, they will feel forgotten and ignored when honour and glory rains on the ML/product team that delivered some new and fancy solution to a long standing problem.

Remember that the data feeding into the ML algorithms is a part of the product stack — if you don’t include the data-owning team in the development, you are not running full stack. (An important reason why full stack teams are better than many alternatives, is that inside teams, people are talking together. And bringing all the players in the value chain into the [full stack] team gets them talking together.)

I have been in a number of organisations, and many times have I run into collaboration problems due to cultural differences of this kind. Never have I seen such barriers drop due to a decree from the C-suit level. Middle management may buy into it, but the rank-and-file employees mostly just give it a scornful look and carry on as before. However, I have been in many teams where we solved this problem by inviting the other party into the fold, and talking about it, together.

4. From history, we know that many DS/ML projects grind to a halt due to data access issues, and tackling the organisational, political and technical challenges prior to the ML project may help remove these barriers.

While the paragraph on cultural change is about human behaviour, I place this one in the category of technical states of affairs. When data is integrated into the platform, it should be safely stored and easy to obtain and use in the right way. For a large organisation, having a strategy and policies for data integration is key. But there is a difference between rigging an infrastructure for data integration together with a minimum of processes around this infrastructure, to that of scavenging through the enterprise and integrating a shit load of data. Yes, you need the platform and the policies, but you don’t integrate data before you know that you need it. And, when you do this step by step, you can benefit from iterative development of the data platform too.

A basic platform infrastructure should also come with the necessary policies to ensure compliance to regulations, privacy and other concerns. Concerns that come with being an organisation that uses machine learning and artificial intelligence to make decisions, that trains on data that may or may not be generated by individuals that may or may not have given their consent to different uses of that data.

But to circle back to the first driver, about not knowing what data the ML projects may get their hands on — you still need something to help people navigate the data residing in various parts of the organisation. And if we are not to run an integration project first, what do we do? Establish a catalogue where departments and teams are rewarded for adding a block of text about what kinds of data they are sitting on. Just a brief description of the data; what kind of data, what it is about, who are stewards of the data, and perhaps with a guess to what it can be used for. Put this into a text database or similar structure, and make it searchable . Or, even better, let the database back an AI-assistant that allows you to do proper semantic searches through the descriptions of the datasets. As time (and projects) passes by, the catalogue can be extended with further information and documentation as data is integrated into the platform and documentation is created. And if someone queries a department regarding their dataset, you may just as well shove both the question and the answer into the catalogue database too.

Such a database, containing mostly free text, is a much cheaper alternative to a readily integrated data platform with comprehensive documentation. You just need the different data-owning teams and departments to dump some of their documentation into the database. They may even use generative AI to produce the documentation (allowing them to check off that OKR too ).

5. Summing up

To sum up, in the context of ML-projects, the data integration efforts should be attacked by:

Establish a data platform/data mesh strategy, together with the minimally required infrastructure and policies.
Create a catalogue of dataset descriptions that can be queried by using free text search, as a low-cost data discovery tool. Incentivise the different groups to populate the database through use of KPIs or other mechanisms.
Integrate data into the platform or mesh on a use case per use case basis, working backwards from the use case and ML experiments, making sure the integrated data is both necessary and sufficient for its intended use.
Solve cultural, cross departmental (or silo) barriers by including the relevant resources into the ML project’s full stack team, and…
Talk Together

Good luck!

Regards
-daniel-

The post A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration appeared first on Towards Data Science.

Editors Pick | Towards Data Science

Are You Sure Your Posterior Makes Sense?

Introduction

Key Concepts

How samplers work

Monte Carlo Markov Chain (MCMC)

Case Study

Model 1

Model 2

Sampler Diagnostics

Quantitative Metrics

R-hat (Potential Scale Reduction Factor)

Effective Sample Size (ESS)

Visual Diagnostics

Rank Plots

KDE and Trace Plots

Pair Plots

Some Fixing Techniques

Strategy 1: Reparameterization

Strategy 2: Handling Heavy-tailed Distributions

Strategy 3: Hyperparameter Tuning

Strategy 4: Better Proposal Distributions

Improve Prior Specification

Conclusion

References

Ivory Tower Notes: The Problem

How every Machine Learning and AI journey starts

So, what’s “the” problem about?

Think like a Data Scientist.

Ask the “good” questions.

Defining the problem determines every step after.

Now, going back to memory lane…

References:

Mining Rules from Data

Case

Data Preprocessing

Decision Tree Classifier: Theory

Decision Trees: practice

Working with high cardinality categories

Summary

Reference

Avoiding Costly Mistakes with Uncertainty Quantification for Algorithmic Home Valuations

Price Prediction Uncertainty?

AVMU—An Uncertainty Quantification Technique for AVMs

Ethics and Other Applications of AVMU

References

Linear Programming: Managing Multiple Targets with Goal Programming

Definition and Use Case of Goal Programming

The weights approach

Preemptive Approach

Conclusion

Earlier articles in this series:

The Art of Noise

Introduction

How Diffusion Model Works

PyTorch Implementation

The U-Net Architecture: Time Embedding

The U-Net Architecture: DoubleConv

The U-Net Architecture: Encoder

The U-Net Architecture: Decoder

The U-Net Architecture: Putting All Components Together

Dataset Preparation

Noise Scheduler

Training

Inference

Ending

References

AI in Social Research and Polling

Hi, can I ask you a short series of questions?

Do we really need the humans?

What have they tried?

Does it work?

On synthetic data

Is it a good idea?

Sweeping social problems under an LLM rug

Further Reading

Graph Neural Networks Part 3: How GraphSAGE Handles Changing Graph Structure

Two Key Problems with GCNs and GATs

Problem 1. They don’t generalize

Problem 2. They have scalability issues