Data Science | Towards Data Science

Are You Sure Your Posterior Makes Sense?

Felipe Bandeira — Fri, 11 Apr 2025 18:38:41 +0000

This article is co-authored by Felipe Bandeira, Giselle Fretta, Thu Than, and Elbion Redenica. We also thank Prof. Carl Scheffler for his support.

Introduction

Parameter estimation has been for decades one of the most important topics in statistics. While frequentist approaches, such as Maximum Likelihood Estimations, used to be the gold standard, the advance of computation has opened space for Bayesian methods. Estimating posterior distributions with Mcmc samplers became increasingly common, but reliable inferences depend on a task that is far from trivial: making sure that the sampler — and the processes it executes under the hood — worked as expected. Keeping in mind what Lewis Caroll once wrote: “If you don’t know where you’re going, any road will take you there.”

This article is meant to help data scientists evaluate an often overlooked aspect of Bayesian parameter estimation: the reliability of the sampling process. Throughout the sections, we combine simple analogies with technical rigor to ensure our explanations are accessible to data scientists with any level of familiarity with Bayesian methods. Although our implementations are in Python with PyMC, the concepts we cover are useful to anyone using an MCMC algorithm, from Metropolis-Hastings to NUTS.

Key Concepts

No data scientist or statistician would disagree with the importance of robust parameter estimation methods. Whether the objective is to make inferences or conduct simulations, having the capacity to model the data generation process is a crucial part of the process. For a long time, the estimations were mainly performed using frequentist tools, such as Maximum Likelihood Estimations (MLE) or even the famous Least Squares optimization used in regressions. Yet, frequentist methods have clear shortcomings, such as the fact that they are focused on point estimates and do not incorporate prior knowledge that could improve estimates.

As an alternative to these tools, Bayesian methods have gained popularity over the past decades. They provide statisticians not only with point estimates of the unknown parameter but also with confidence intervals for it, all of which are informed by the data and by the prior knowledge researchers held. Originally, Bayesian parameter estimation was done through an adapted version of Bayes’ theorem focused on unknown parameters (represented as θ) and known data points (represented as x). We can define P(θ|x), the posterior distribution of a parameter’s value given the data, as:

\[ P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)} \]

In this formula, P(x|θ) is the likelihood of the data given a parameter value, P(θ) is the prior distribution over the parameter, and P(x) is the evidence, which is computed by integrating all possible values of the prior:

\[ P(x) = \int_\theta P(x, \theta) d\theta \]

In some cases, due to the complexity of the calculations required, deriving the posterior distribution analytically was not possible. However, with the advance of computation, running sampling algorithms (especially MCMC ones) to estimate posterior distributions has become easier, giving researchers a powerful tool for situations where analytical posteriors are not trivial to find. Yet, with such power also comes a large amount of responsibility to ensure that results make sense. This is where sampler diagnostics come in, offering a set of valuable tools to gauge 1) whether an MCMC algorithm is working well and, consequently, 2) whether the estimated distribution we see is an accurate representation of the real posterior distribution. But how can we know so?

How samplers work

Before diving into the technicalities of diagnostics, we shall cover how the process of sampling a posterior (especially with an MCMC sampler) works. In simple terms, we can think of a posterior distribution as a geographical area we haven’t been to but need to know the topography of. How can we draw an accurate map of the region?

One of our favorite analogies comes from Ben Gilbert. Suppose that the unknown region is actually a house whose floorplan we wish to map. For some reason, we cannot directly visit the house, but we can send bees inside with GPS devices attached to them. If everything works as expected, the bees will fly around the house, and using their trajectories, we can estimate what the floor plan looks like. In this analogy, the floor plan is the posterior distribution, and the sampler is the group of bees flying around the house.

The reason we are writing this article is that, in some cases, the bees won’t fly as expected. If they get stuck in a certain room for some reason (because someone dropped sugar on the floor, for example), the data they return won’t be representative of the entire house; rather than visiting all rooms, the bees only visited a few, and our picture of what the house looks like will ultimately be incomplete. Similarly, when a sampler does not work correctly, our estimation of the posterior distribution is also incomplete, and any inference we draw based on it is likely to be wrong.

Monte Carlo Markov Chain (MCMC)

In technical terms, we call an MCMC process any algorithm that undergoes transitions from one state to another with certain properties. Markov Chain refers to the fact that the next state only depends on the current one (or that the bee’s next location is only influenced by its current place, and not by all of the places where it has been before). Monte Carlo means that the next state is chosen randomly. MCMC methods like Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo (HMC), and No-U-Turn Sampler (NUTS) all operate by constructing Markov Chains (a sequence of steps) that are close to random and gradually explore the posterior distribution.

Now that you understand how a sampler works, let’s dive into a practical scenario to help us explore sampling problems.

Case Study

Imagine that, in a faraway nation, a governor wants to understand more about public annual spending on healthcare by mayors of cities with less than 1 million inhabitants. Rather than looking at sheer frequencies, he wants to understand the underlying distribution explaining expenditure, and a sample of spending data is about to arrive. The problem is that two of the economists involved in the project disagree about how the model should look.

Model 1

The first economist believes that all cities spend similarly, with some variation around a certain mean. As such, he creates a simple model. Although the specifics of how the economist chose his priors are irrelevant to us, we do need to keep in mind that he is trying to approximate a Normal (unimodal) distribution.

\[
x_i \sim \text{Normal}(\mu, \sigma^2) \text{ i.i.d. for all } i \\
\mu \sim \text{Normal}(10, 2) \\
\sigma^2 \sim \text{Uniform}(0,5)
\]

Model 2

The second economist disagrees, arguing that spending is more complex than his colleague believes. He believes that, given ideological differences and budget constraints, there are two kinds of cities: the ones that do their best to spend very little and the ones that are not afraid of spending a lot. As such, he creates a slightly more complex model, using a mixture of normals to reflect his belief that the true distribution is bimodal.

\[
x_i \sim \text{Normal-Mixture}([\omega, 1-\omega], [m_1, m_2], [s_1^2, s_2^2]) \text{ i.i.d. for all } i\\
m_j \sim \text{Normal}(2.3, 0.5^2) \text{ for } j = 1,2 \\
s_j^2 \sim \text{Inverse-Gamma}(1,1) \text{ for } j=1,2 \\
\omega \sim \text{Beta}(1,1)
\]

After the data arrives, each economist runs an MCMC algorithm to estimate their desired posteriors, which will be a reflection of reality (1) if their assumptions are true and (2) if the sampler worked correctly. The first if, a discussion about assumptions, shall be left to the economists. However, how can they know whether the second if holds? In other words, how can they be sure that the sampler worked correctly and, as a consequence, their posterior estimations are unbiased?

Sampler Diagnostics

To evaluate a sampler’s performance, we can explore a small set of metrics that reflect different parts of the estimation process.

Quantitative Metrics

R-hat (Potential Scale Reduction Factor)

In simple terms, R-hat evaluates whether bees that started at different places have all explored the same rooms at the end of the day. To estimate the posterior, an MCMC algorithm uses multiple chains (or bees) that start at random locations. R-hat is the metric we use to assess the convergence of the chains. It measures whether multiple MCMC chains have mixed well (i.e., if they have sampled the same topography) by comparing the variance of samples within each chain to the variance of the sample means across chains. Intuitively, this means that

\[
\hat{R} = \sqrt{\frac{\text{Variance Between Chains}}{\text{Variance Within Chains}}}
\]

If R-hat is close to 1.0 (or below 1.01), it means that the variance within each chain is very similar to the variance between chains, suggesting that they have converged to the same distribution. In other words, the chains are behaving similarly and are also indistinguishable from one another. This is precisely what we see after sampling the posterior of the first model, shown in the last column of the table below:

Figure 1. Summary statistics of the sampler highlighting ideal R-hats.

The r-hat from the second model, however, tells a different story. The fact we have such large r-hat values indicates that, at the end of the sampling process, the different chains had not converged yet. In practice, this means that the distribution they explored and returned was different, or that each bee created a map of a different room of the house. This fundamentally leaves us without a clue of how the pieces connect or what the complete floor plan looks like.

Figure 2. Summary statistics of the sampler showcasing problematic R-hats.

Given our R-hat readouts were large, we know something went wrong with the sampling process in the second model. However, even if the R-hat had turned out within acceptable levels, this does not give us certainty that the sampling process worked. R-hat is just a diagnostic tool, not a guarantee. Sometimes, even if your R-hat readout is lower than 1.01, the sampler might not have properly explored the full posterior. This happens when multiple bees start their exploration in the same room and remain there. Likewise, if you’re using a small number of chains, and if your posterior happens to be multimodal, there is a probability that all chains started in the same mode and failed to explore other peaks.

The R-hat readout reflects convergence, not completion. In order to have a more comprehensive idea, we need to check other diagnostic metrics as well.

Effective Sample Size (ESS)

When explaining what MCMC was, we mentioned that “Monte Carlo” refers to the fact that the next state is chosen randomly. This does not necessarily mean that the states are fully independent. Even though the bees choose their next step at random, these steps are still correlated to some extent. If a bee is exploring a living room at time t=0, it will probably still be in the living room at time t=1, even though it is in a different part of the same room. Due to this natural connection between samples, we say these two data points are autocorrelated.

Due to their nature, MCMC methods inherently produce autocorrelated samples, which complicates statistical analysis and requires careful evaluation. In statistical inference, we often assume independent samples to ensure that the estimates of uncertainty are accurate, hence the need for uncorrelated samples. If two data points are too similar to each other, the correlation reduces their effective information content. Mathematically, the formula below represents the autocorrelation function between two time points (t1 and t2) in a random process:

\[
R_{XX}(t_1, t_2) = E[X_{t_1} \overline{X_{t_2}}]
\]

where E is the expected value operator and X-bar is the complex conjugate. In MCMC sampling, this is crucial because high autocorrelation means that new samples don’t teach us anything different from the old ones, effectively reducing the sample size we have. Unsurprisingly, the metric that reflects this is called Effective Sample Size (ESS), and it helps us determine how many truly independent samples we have.

As hinted previously, the effective sample size accounts for autocorrelation by estimating how many truly independent samples would provide the same information as the autocorrelated samples we have. Mathematically, for a parameter θ, the ESS is defined as:

\[
ESS = \frac{n}{1 + 2 \sum_{k=1}^{\infty} \rho(\theta)_k}
\]

where n is the total number of samples and ρ(θ)k is the autocorrelation at lag k for parameter θ.

Typically, for ESS readouts, the higher, the better. This is what we see in the readout for the first model. Two common ESS variations are Bulk-ESS, which assesses mixing in the central part of the distribution, and Tail-ESS, which focuses on the efficiency of sampling the distribution’s tails. Both inform us if our model accurately reflects the central tendency and credible intervals.

Figure 3. Summary statistics of the sampler highlighting ideal quantities for ESS bulk and tail.

In contrast, the readouts for the second model are very bad. Typically, we want to see readouts that are at least 1/10 of the total sample size. In this case, given each chain sampled 2000 observations, we should expect ESS readouts of at least 800 (from the total size of 8000 samples across 4 chains of 2000 samples each), which is not what we observe.

Figure 4. Summary statistics of the sampler demonstrating problematic ESS bulk and tail.

Visual Diagnostics

Apart from the numerical metrics, our understanding of sampler performance can be deepened through the use of diagnostic plots. The main ones are rank plots, trace plots, and pair plots.

Rank Plots

A rank plot helps us identify whether the different chains have explored all of the posterior distribution. If we once again think of the bee analogy, rank plots tell us which bees explored which parts of the house. Therefore, to evaluate whether the posterior was explored equally by all chains, we observe the shape of the rank plots produced by the sampler. Ideally, we want the distribution of all chains to look roughly uniform, like in the rank plots generated after sampling the first model. Each color below represents a chain (or bee):

Figure 5. Rank plots for parameters ‘m’ and ‘s’ across four MCMC chains. Each bar represents the distribution of rank values for one chain, with ideally uniform ranks indicating good mixing and proper convergence.

Under the hood, a rank plot is produced with a simple sequence of steps. First, we run the sampler and let it sample from the posterior of each parameter. In our case, we are sampling posteriors for parameters m and s of the first model. Then, parameter by parameter, we get all samples from all chains, put them together, and order them from smallest to largest. We then ask ourselves, for each sample, what was the chain where it came from? This will allow us to create plots like the ones we see above.

In contrast, bad rank plots are easy to spot. Unlike the previous example, the distributions from the second model, shown below, are not uniform. From the plots, what we interpret is that each chain, after beginning at different random locations, got stuck in a region and did not explore the entirety of the posterior. Consequently, we cannot make inferences from the results, as they are unreliable and not representative of the true posterior distribution. This would be equivalent to having four bees that started at different rooms of the house and got stuck somewhere during their exploration, never covering the entirety of the property.

Figure 6. Rank plots for parameters m, s_squared, and w across four MCMC chains. Each subplot shows the distribution of ranks by chain. There are noticeable deviations from uniformity (e.g., stair-step patterns or imbalances across chains) suggesting potential sampling issues.

KDE and Trace Plots

Similar to R-hat, trace plots help us assess the convergence of MCMC samples by visualizing how the algorithm explores the parameter space over time. PyMC provides two types of trace plots to diagnose mixing issues: Kernel Density Estimate (KDE) plots and iteration-based trace plots. Each of these serves a distinct purpose in evaluating whether the sampler has properly explored the target distribution.

The KDE plot (usually on the left) estimates the posterior density for each chain, where each line represents a separate chain. This allows us to check whether all chains have converged to the same distribution. If the KDEs overlap, it suggests that the chains are sampling from the same posterior and that mixing has occurred. On the other hand, the trace plot (usually on the right) visualizes how parameter values change over MCMC iterations (steps), with each line representing a different chain. A well-mixed sampler will produce trace plots that look noisy and random, with no clear structure or separation between chains.

Using the bee analogy, trace plots can be thought of as snapshots of the “features” of the house at different locations. If the sampler is working correctly, the KDEs in the left plot should align closely, showing that all bees (chains) have explored the house similarly. Meanwhile, the right plot should show highly variable traces that blend together, confirming that the chains are actively moving through the space rather than getting stuck in specific regions.

Figure 7. Density and trace plots for parameters m and s from the first model across four MCMC chains. The left panel shows kernel density estimates (KDE) of the marginal posterior distribution for each chain, indicating consistent central tendency and spread. The right panel displays the trace plot over iterations, with overlapping chains and no apparent divergences, suggesting good mixing and convergence.

However, if your sampler has poor mixing or convergence issues, you will see something like the figure below. In this case, the KDEs will not overlap, meaning that different chains have sampled from different distributions rather than a shared posterior. The trace plot will also show structured patterns instead of random noise, indicating that chains are stuck in different regions of the parameter space and failing to fully explore it.

Figure 8. KDE (left) and trace plots (right) for parameters m, s_squared, and w across MCMC chains for the second model. Multimodal distributions are visible for m and w, suggesting potential identifiability issues. Trace plots reveal that chains explore different modes with limited mixing, particularly for m, highlighting challenges in convergence and effective sampling.

By using trace plots alongside the other diagnostics, you can identify sampling issues and determine whether your MCMC algorithm is effectively exploring the posterior distribution.

Pair Plots

A third kind of plot that is often useful for diagnostic are pair plots. In models where we want to estimate the posterior distribution of multiple parameters, pair plots allow us to observe how different parameters are correlated. To understand how such plots are formed, think again about the bee analogy. If you imagine that we’ll create a plot with the width and length of the house, each “step” that the bees take can be represented by an (x, y) combination. Likewise, each parameter of the posterior is represented as a dimension, and we create scatter plots showing where the sampler walked using parameter values as coordinates. Here, we are plotting each unique pair (x, y), resulting in the scatter plot you see in the middle of the image below. The one-dimensional plots you see on the edges are the marginal distributions over each parameter, giving us additional information on the sampler’s behavior when exploring them.

Take a look at the pair plot from the first model.

Figure 9. Joint posterior distribution of parameters m and s, with marginal densities. The scatter plot shows a roughly symmetric, elliptical shape, suggesting a low correlation between m and s.

Each axis represents one of the two parameters whose posteriors we are estimating. For now, let’s focus on the scatter plot in the middle, which shows the parameter combinations sampled from the posterior. The fact we have a very even distribution means that, for any particular value of m, there was a range of values of s that were equally likely to be sampled. Additionally, we don’t see any correlation between the two parameters, which is usually good! There are cases when we would expect some correlation, such as when our model involves a regression line. However, in this instance, we have no reason to believe two parameters should be highly correlated, so the fact we don’t observe unusual behavior is positive news.

Now, take a look at the pair plots from the second model.

Figure 10. Pair plot of the joint posterior distributions for parameters m, s_squared, and w. The scatter plots reveal strong correlations between several parameters.

Given that this model has five parameters to be estimated, we naturally have a greater number of plots since we are analyzing them pair-wise. However, they look odd compared to the previous example. Namely, rather than having an even distribution of points, the samples here either seem to be divided across two regions or seem somewhat correlated. This is another way of visualizing what the rank plots have shown: the sampler did not explore the full posterior distribution. Below we isolated the top left plot, which contains the samples from m0 and m1. Unlike the plot from model 1, here we see that the value of one parameter greatly influences the value of the other. If we sampled m1 around 2.5, for example, m0 is likely to be sampled from a very narrow range around 1.5.

Figure 11. Joint posterior distribution of parameters m₀ and m₁, with marginal densities.

Certain shapes can be observed in problematic pair plots relatively frequently. Diagonal patterns, for example, indicate a high correlation between parameters. Banana shapes are often connected to parametrization issues, often being present in models with tight priors or constrained parameters. Funnel shapes might indicate hierarchical models with bad geometry. When we have two separate islands, like in the plot above, this can indicate that the posterior is bimodal AND that the chains haven’t mixed well. However, keep in mind that these shapes might indicate problems, but not necessarily do so. It’s up to the data scientist to examine the model and determine which behaviors are expected and which ones are not!

Some Fixing Techniques

When your diagnostics indicate sampling problems — whether concerning R-hat values, low ESS, unusual rank plots, separated trace plots, or strange parameter correlations in pair plots — several strategies can help you address the underlying issues. Sampling problems typically stem from the target posterior being too complex for the sampler to explore efficiently. Complex target distributions might have:

Multiple modes (peaks) that the sampler struggles to move between
Irregular shapes with narrow “corridors” connecting different regions
Areas of drastically different scales (like the “neck” of a funnel)
Heavy tails that are difficult to sample accurately

In the bee analogy, these complexities represent houses with unusual floor plans — disconnected rooms, extremely narrow hallways, or areas that change dramatically in size. Just as bees might get trapped in specific regions of such houses, MCMC chains can get stuck in certain areas of the posterior.

Figure 12. Examples of multimodal target distributions.

Figure 13. Examples of weirdly shaped distributions.

To help the sampler in its exploration, there are simple strategies we can use.

Strategy 1: Reparameterization

Reparameterization is particularly effective for hierarchical models and distributions with challenging geometries. It involves transforming your model’s parameters to make them easier to sample. Back to the bee analogy, imagine the bees are exploring a house with a peculiar layout: a spacious living room that connects to the kitchen through a very, very narrow hallway. One aspect we hadn’t mentioned before is that the bees have to fly in the same way through the entire house. That means that if we dictate the bees should use large “steps,” they will explore the living room very well but hit the walls in the hallway head-on. Likewise, if their steps are small, they will explore the narrow hallway well, but take forever to cover the entire living room. The difference in scales, which is natural to the house, makes the bees’ job more difficult.

A classic example that represents this scenario is Neal’s funnel, where the scale of one parameter depends on another:

\[
p(y, x) = \text{Normal}(y|0, 3) \times \prod_{n=1}^{9} \text{Normal}(x_n|0, e^{y/2})
\]

Figure 14. Log the marginal density of y and the first dimension of Neal’s funnel. The neck is where the sampler is struggling to sample from and the step size is required to be much smaller than the body. (Image source: Stan User’s Guide)

We can see that the scale of x is dependent on the value of y. To fix this problem, we can separate x and y as independent standard Normals and then transform these variables into the desired funnel distribution. Instead of sampling directly like this:

\[
\begin{align*}
y &\sim \text{Normal}(0, 3) \\
x &\sim \text{Normal}(0, e^{y/2})
\end{align*}
\]

You can reparameterize to sample from standard Normals first:

\[
y_{raw} \sim \text{Standard Normal}(0, 1) \\
x_{raw} \sim \text{Standard Normal}(0, 1) \\
\\
y = 3y_{raw} \\
x = e^{y/2} x_{raw}
\]

This technique separates the hierarchical parameters and makes sampling more efficient by eliminating the dependency between them.

Reparameterization is like redesigning the house such that instead of forcing the bees to find a single narrow hallway, we create a new layout where all passages have similar widths. This helps the bees use a consistent flying pattern throughout their exploration.

Strategy 2: Handling Heavy-tailed Distributions

Heavy-tailed distributions like Cauchy and Student-T present challenges for samplers and the ideal step size. Their tails require larger step sizes than their central regions (similar to very long hallways that require the bees to travel long distances), which creates a challenge:

Small step sizes lead to inefficient sampling in the tails
Large step sizes cause too many rejections in the center

Figure 15. Probability density functions for various Cauchy distributions illustrate the effects of changing the location parameter and scale parameter. (Image source: Wikipedia)

Reparameterization solutions include:

For Cauchy: Defining the variable as a transformation of a Uniform distribution using the Cauchy inverse CDF
For Student-T: Using a Gamma-Mixture representation

Strategy 3: Hyperparameter Tuning

Sometimes the solution lies in adjusting the sampler’s hyperparameters:

Increase total iterations: The simplest approach — give the sampler more time to explore.
Increase target acceptance rate (adapt_delta): Reduce divergent transitions (try 0.9 instead of the default 0.8 for complex models, for example).
Increase max_treedepth: Allow the sampler to take more steps per iteration.
Extend warmup/adaptation phase: Give the sampler more time to adapt to the posterior geometry.

Remember that while these adjustments may improve your diagnostic metrics, they often treat symptoms rather than underlying causes. The previous strategies (reparameterization and better proposal distributions) typically offer more fundamental solutions.

Strategy 4: Better Proposal Distributions

This solution is for function fitting processes, rather than sampling estimations of the posterior. It basically asks the question: “I’m currently here in this landscape. Where should I jump to next so that I explore the full landscape, or how do I know that the next jump is the jump I should make?” Thus, choosing a good distribution means making sure that the sampling process explores the full parameter space instead of just a specific region. A good proposal distribution should:

Have substantial probability mass where the target distribution does.
Allow the sampler to make jumps of the appropriate size.

One common choice of the proposal distribution is the Gaussian (Normal) distribution with mean μ and standard deviation σ — the scale of the distribution that we can tune to decide how far to jump from the current position to the next position. If we choose the scale for the proposal distribution to be too small, it might either take too long to explore the entire posterior or it will get stuck in a region and never explore the full distribution. But if the scale is too large, you might never get to explore some regions, jumping over them. It’s like playing ping-pong where we only reach the two edges but not the middle.

Improve Prior Specification

When all else fails, reconsider your model’s prior specifications. Vague or weakly informative priors (like uniformly distributed priors) can sometimes lead to sampling difficulties. More informative priors, when justified by domain knowledge, can help guide the sampler toward more reasonable regions of the parameter space. Sometimes, despite your best efforts, a model may remain challenging to sample effectively. In such cases, consider whether a simpler model might achieve similar inferential goals while being more computationally tractable. The best model is often not the most complex one, but the one that balances complexity with reliability. The table below shows the summary of fixing strategies for different issues.

Diagnostic Signal	Potential Issue	Recommended Fix
High R-hat	Poor mixing between chains	Increase iterations, adjust the step size
Low ESS	High autocorrelation	Reparameterization, increase adapt_delta
Non-uniform rank plots	Chains stuck in different regions	Better proposal distribution, start with multiple chains
Separated KDEs in trace plots	Chains exploring different distributions	Reparameterization
Funnel shapes in pair plots	Hierarchical model issues	Non-centered reparameterization
Disjoint clusters in pair plots	Multimodality with poor mixing	Adjusted distribution, simulated annealing

Conclusion

Assessing the quality of MCMC sampling is crucial for ensuring reliable inference. In this article, we explored key diagnostic metrics such as R-hat, ESS, rank plots, trace plots, and pair plots, discussing how each helps determine whether the sampler is performing properly.

If there’s one takeaway we want you to keep in mind it’s that you should always run diagnostics before drawing conclusions from your samples. No single metric provides a definitive answer — each serves as a tool that highlights potential issues rather than proving convergence. When problems arise, strategies such as reparameterization, hyperparameter tuning, and prior specification can help improve sampling efficiency.

By combining these diagnostics with thoughtful modeling decisions, you can ensure a more robust analysis, reducing the risk of misleading inferences due to poor sampling behavior.

References

B. Gilbert, Bob’s bees: the importance of using multiple bees (chains) to judge MCMC convergence (2018), Youtube

Chi-Feng, MCMC demo (n.d.), GitHub

D. Simpson, Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it. (2019), Statistical Modeling, Causal Inference, and Social Science

M. Taboga, Markov Chain Monte Carlo (MCMC) methods (2021), Lectures on probability theory and mathematical Statistics. Kindle Direct Publishing.

T. Wiecki, MCMC Sampling for Dummies (2024), twecki.io
Stan User’s Guide, Reparametrization (n.d.), Stan Documentation

The post Are You Sure Your Posterior Makes Sense? appeared first on Towards Data Science.

The Invisible Revolution: How Vectors Are (Re)defining Business Success

Felix Schmidt — Thu, 10 Apr 2025 20:52:15 +0000

In a world that focuses more on data, business leaders must understand vector thinking. At first, vectors may appear as complicated as algebra was in school, but they serve as a fundamental building block. Vectors are as essential as algebra for tasks like sharing a bill or computing interest. They underpin our digital systems for decision making, customer engagement, and data protection.

They represent a radically different concept of relationships and patterns. They do not simply divide data into rigid categories. Instead, they offer a dynamic, multidimensional view of the underlying connections. Like “Similar” for two customers may mean more than demographics or purchase histories. It’s their behaviors, preferences, and habits that distinctly align. Such associations can be defined and measured accurately in a vector space. But for many modern businesses, the logic is too complex. So leaders tend to fall back on old, learned, rule-based patterns instead. And back then, fraud detection, for example, still used simple rules on transaction limits. We’ve evolved to recognize patterns and anomalies.

While it might have been common to block transactions that allocate 50% of your credit card limit at once just a few years ago, we are now able to analyze your retailer-specific spend history, look at average baskets of other customers at the very same retailers, and do some slight logic checks such as the physical location of your previous spends.

So a $7,000 transaction for McDonald’s in Dubai might just not happen if you just spent $3 on a bike rental in Amsterdam. Even $20 wouldn’t work since logical vector patterns can rule out the physical distance to be valid. Instead, the $7,000 transaction for your new E-Bike at a retailer near Amsterdam’s city center may just work flawlessly. Welcome to the insight of living in a world managed by vectors.

The danger of ignoring the paradigm of vectors is huge. Not mastering algebra can lead to bad financial decisions. Similarly, not knowing vectors can leave you vulnerable as a business leader. While the average customer may stay unaware of vectors as much as an average passenger in a plane is of aerodynamics, a business leader should be at least aware of what kerosene is and how many seats are to be occupied to break even for a specific flight. You may not need to fully understand the systems you rely on. A basic understanding helps to know when to reach out to the experts. And this is exactly my aim in this little journey into the world of vectors: become aware of the basic principles and know when to ask for more to better steer and manage your business.

In the hushed hallways of research labs and tech companies, a revolution was brewing. It would change how computers understood the world. This revolution has nothing to do with processing power or storage capacity. It was all about teaching machines to understand context, meaning, and nuance in words. This uses mathematical representations called vectors. Before we can appreciate the magnitude of this shift, we first need to understand what it differs from.

Think about the way humans take in information. When we look at a cat, we don’t just process a checklist of components: whiskers, fur, four legs. Instead, our brains work through a network of relationships, contexts, and associations. We know a cat is more like a lion than a bicycle. It’s not from memorizing this fact. Our brains have naturally learned these relationships. It boils down to target_transform_sequence or equivalent. Vector representations let computers consume content in a human-like way. And we ought to understand how and why this is true. It’s as fundamental as knowing algebra in the time of an impending AI revolution.

In this brief jaunt in the vector realm, I will explain how vector-based computing works and why it’s so transformative. The code examples are only examples, so they are just for illustration and have no stand-alone functionality. You don’t have to be an engineer to understand those concepts. All you have to do is follow along, as I walk you through examples with plain language commentary explaining each one step by step, one step at a time. I don’t aim to be a world-class mathematician. I want to make vectors understandable to everyone: business leaders, managers, engineers, musicians, and others.

What are vectors, anyway?

Photo by Pete F on Unsplash

It is not that the vector-based computing journey started recently. Its roots go back to the 1950s with the development of distributed representations in cognitive science. James McClelland and David Rumelhart, among other researchers, theorized that the brain holds concepts not as individual entities. Instead, it holds them as the compiled activity patterns of neural networks. This discovery dominated the path for contemporary vector representations.

The real breakthrough was three things coming together:
The exponential growth in computational power, the development of sophisticated neural network architectures, and the availability of massive datasets for training.

It is the combination of these elements that makes vector-based systems theoretically possible and practically implementable at scale. AI as the mainstream as people got to know it (with the likes of ChatGPT e.a.) is the direct consequence of this.

To better understand, let me put this in context: Conventional computing systems work on symbols —discrete, human-readable symbols and rules. A traditional system, for instance, might represent a customer as a record:

customer = {
    'id': '12345',
    'age': 34,
    'purchase_history': ['electronics', 'books'],
    'risk_level': 'low'
}

This representation may be readable or logical, but it misses subtle patterns and relationships. In contrast, vector representations encode information within high-dimensional space where relationships arise naturally through geometric proximity. That same customer might be represented as a 384-dimensional vector where each one of these dimensions contributes to a rich, nuanced profile. Simple code allows for 2-Dimensional customer data to be transformed into vectors. Let’s take a look at how simple this just is:

from sentence_transformers import SentenceTransformer
import numpy as np

class CustomerVectorization:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        
    def create_customer_vector(self, customer_data):
        """
        Transform customer data into a rich vector representation
        that captures subtle patterns and relationships
        """
        # Combine various customer attributes into a meaningful text representation
        customer_text = f"""
        Customer profile: {customer_data['age']} year old,
        interested in {', '.join(customer_data['purchase_history'])},
        risk level: {customer_data['risk_level']}
        """
        
        # Generate base vector from text description
        base_vector = self.model.encode(customer_text)
        
        # Enrich vector with numerical features
        numerical_features = np.array([
            customer_data['age'] / 100,  # Normalized age
            len(customer_data['purchase_history']) / 10,  # Purchase history length
            self._risk_level_to_numeric(customer_data['risk_level'])
        ])
        
        # Combine text-based and numerical features
        combined_vector = np.concatenate([
            base_vector,
            numerical_features
        ])
        
        return combined_vector
    
    def _risk_level_to_numeric(self, risk_level):
        """Convert categorical risk level to normalized numeric value"""
        risk_mapping = {'low': 0.1, 'medium': 0.5, 'high': 0.9}
        return risk_mapping.get(risk_level.lower(), 0.5)

I trust that this code example has helped demonstrate how easily complex customer data can be encoded into meaningful vectors. The method seems complex at first. But, it is simple. We merge text and numerical data on customers. This gives us rich, info-dense vectors that capture each customer’s essence. What I love most about this technique is its simplicity and flexibility. Similarly to how we encoded age, purchase history, and risk levels here, you could replicate this pattern to capture any other customer attributes that boil down to the relevant base case for your use case. Just recall the credit card spending patterns we described earlier. It’s similar data being turned into vectors to have a meaning far greater than it could ever have it stayed 2-dimensional and would be used for traditional rule-based logics.

What our little code example allowed us to do is having two very suggestive representations in one semantically rich space and one in normalized value space, mapping every record to a line in a graph that has direct comparison properties.

This allows the systems to identify complex patterns and relations that traditional data structures won’t be able to reflect adequately. With the geometric nature of vector spaces, the shape of these structures tells the stories of similarities, differences, and relationships, allowing for an inherently standardized yet flexible representation of complex data.

But going from here, you will see this structure copied across other applications of vector-based customer analysis: use relevant data, aggregate it in a format we can work with, and meta representation combines heterogeneous data into a common understanding of vectors. Whether it’s recommendation systems, customer segmentation models, or predictive analytics tools, this fundamental approach to thoughtful vectorization will underpin all of it. Thus, this fundamental approach is significant to know and understand even if you consider yourself non-tech and more into the business side.

Just keep in mind — the key is considering what part of your data has meaningful signals and how to encode them in a way that preserves their relationships. It is nothing but following your business logic in another way of thinking other than algebra. A more modern, multi-dimensional way.

The Mathematics of Meaning (Kings and Queens)

Photo by Debbie Fan on Unsplash

All human communication delivers rich networks of meaning that our brains wire to make sense of automatically. These are meanings that we can capture mathematically, using vector-based computing; we can represent words in space so that they are points in a multi-dimensional word space. This geometrical treatment allows us to think in spatial terms about the abstract semantic relations we are interested in, as distances and directions.

For instance, the relationship “King is to Queen as Man is to Woman” is encoded in a vector space in such a way that the direction and distance between the words “King” and “Queen” are similar to those between the words “Man” and “Woman.”

Let’s take a step back to understand why this might be: the key component that makes this system work is word embeddings — numerical representations that encode words as vectors in a dense vector space. These embeddings are derived from examining co-occurrences of words across large snippets of text. Just as we learn that “dog” and “puppy” are related concepts by observing that they occur in similar contexts, embedding algorithms learn to embed these words close to each other in a vector space.

Word embeddings reveal their real power when we look at how they encode analogical relationships. Think about what we know about the relationship between “king” and “queen.” We can tell through intuition that these words are different in gender but share associations related to the palace, authority, and leadership. Through a wonderful property of vector space systems — vector arithmetic — this relationship can be captured mathematically.

One does this beautifully in the classic example:

vector('king') - vector('man') + vector('woman') ≈ vector('queen')

This equation tells us that if we have the vector for “king,” and we subtract out the “man” vector (we remove the concept of “male”), and then we add the “woman” vector (we add the concept of “female”), we get a new point in space very close to that of “queen.” That’s not some mathematical coincidence — it’s based on how the embedding space has arranged the meaning in a sort of structured way.

We can apply this idea of context in Python with pre-trained word embeddings:

import gensim.downloader as api

# Load a pre-trained model that contains word vectors learned from Google News
model = api.load('word2vec-google-news-300')

# Define our analogy words
source_pair = ('king', 'man')
target_word = 'woman'

# Find which word completes the analogy using vector arithmetic
result = model.most_similar(
    positive=[target_word, source_pair[0]], 
    negative=[source_pair[1]], 
    topn=1
)

# Display the result
print(f"{source_pair[0]} is to {source_pair[1]} as {target_word} is to {result[0][0]}")

The structure of this vector space exposes many basic principles:

Semantic similarity is present as spatial proximity. Related words congregate: the neighborhoods of ideas. “Dog,” “puppy,” and “canine” would be one such cluster; meanwhile, “cat,” “kitten,” and “feline” would create another cluster nearby.
Relationships between words become directions in the space. The vector from “man” to “woman” encodes a gender relationship, and other such relationships (for example, “king” to “queen” or “actor” to “actress”) typically point in the same direction.
The magnitude of vectors can carry meaning about word importance or specificity. Common words often have shorter vectors than specialized terms, reflecting their broader, less specific meanings.

Working with relationships between words in this way gave us a geometric encoding of meaning and the mathematical precision needed to reflect the nuances of natural language processing to machines. Instead of treating words as separate symbols, vector-like systems can recognize patterns, make analogies, and even uncover relationships that were never programmed.

To better grasp what was just discussed I took the liberty to have the words we mentioned before (“King, Man, Women”; “Dog, Puppy, Canine”; “Cat, Kitten, Feline”) mapped to a corresponding 2D vector. These vectors numerically represent semantic meaning.

Visualization of the before-mentioned example terms as 2D word embeddings. Showing grouped categories for explanatory purposes. Data is fabricated and axes are simplified for educational purposes.

Human-related words have high positive values on both dimensions.
Dog-related words have negative x-values and positive y-values.
Cat-related words have positive x-values and negative y-values.

Be aware, those values are fabricated by me to illustrate better. As shown in the 2D Space where the vectors are plotted, you can observe groups based on the positions of the dots representing the vectors. The three dog-related words e.g. can be clustered as the “Dog” category etc. etc.

Grasping these basic principles gives us insight into both the capabilities and limitations of modern language AI, such as large language models (LLMs). Though these systems can do amazing analogical and relational gymnastics, they are ultimately cycles of geometric patterns based on the ways that words appear in proximity to one another in a body of text. An elaborate but, by definition, partial reflection of human linguistic comprehension. As such an Llm, since based on vectors, can only generate as output what it has received as input. Although that doesn’t mean it generates only what it has been trained 1:1, we all know about the fantastic hallucination capabilities of LLMs; it means that LLMs, unless specifically instructed, wouldn’t come up with neologisms or new language to describe things. This basic understanding is still lacking for a lot of business leaders that expect LLMs to be miracle machines unknowledgeable about the underlying principles of vectors.

A Tale of Distances, Angles, and Dinner Parties

Photo by OurWhisky Foundation on Unsplash

Now, let’s assume you’re throwing a dinner party and it’s all about Hollywood and the big movies, and you want to seat people based on what they like. You could just calculate “distance” between their preferences (genres, perhaps even hobbies?) and find out who should sit together. But deciding how you measure that distance can be the difference between compelling conversations and annoyed participants. Or awkward silences. And yes, that company party flashback is repeating itself. Sorry for that!

The same is true in the world of vectors. The distance metric defines how “similar” two vectors look, and therefore, ultimately, how well your system performs to predict an outcome.

Euclidean Distance: Straightforward, but Limited

Euclidean distance measures the straight-line distance between two points in space, making it easy to understand:

Euclidean distance is fine as long as vectors are physical locations.
However, in high-dimensional spaces (like vectors representing user behavior or preferences), this metric often falls short. Differences in scale or magnitude can skew results, focusing on scale over actual similarity.

Example: Two vectors might represent your dinner guests’ preferences for how much streaming services are used:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

While their preferences align, Euclidean distance would make them seem vastly different because of the disparity in overall activity.

But in higher-dimensional spaces, such as user behavior or textual meaning, Euclidean distance becomes increasingly less informative. It overweights magnitude, which can obscure comparisons. Consider two moviegoers: one has seen 200 action movies, the other has seen 10, but they both like the same genres. Because of their sheer activity level, the second viewer would appear much less similar to the first when using Euclidean distance though all they ever watched is Bruce Willis movies.

Cosine Similarity: Focused on Direction

The cosine similarity method takes a different approach. It focuses on the angle between vectors, not their magnitudes. It’s like comparing the path of two arrows. If they point the same way, they are aligned, no matter their lengths. This shows that it’s perfect for high-dimensional data, where we care about relationships, not scale.

If two vectors point in the same direction, they’re considered similar (cosine similarity approx of 1).
When opposing (so pointing in opposite directions), they differ (cosine similarity ≈ -1).
If they’re perpendicular (at a right angle of 90° to one another), they are unrelated (cosine similarity close to 0).

This normalizing property ensures that the similarity score correctly measures alignment, regardless of how one vector is scaled in comparison to another.

Example: Returning to our streaming preferences, let’s take a look at how our dinner guest’s preferences would look like as vectors:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

Let us discuss why cosine similarity is really effective in this case. So, when we compute cosine similarity for vec1 [5, 10, 5] and vec2 [1, 2, 1], we’re essentially trying to see the angle between these vectors.

The dot product normalizes the vectors first, dividing each component by the length of the vector. This operation “cancels” the differences in magnitude:

So for vec1: Normalization gives us [0.41, 0.82, 0.41] or so.
For vec2: Which resolves to [0.41, 0.82, 0.41] after normalization we will also have it.

And now we also understand why these vectors would be considered identical with regard to cosine similarity because their normalized versions are identical!

This tells us that even though dinner guest A views more total content, the proportion they allocate to any given genre perfectly mirrors dinner guest B’s preferences. It’s like saying both your guests dedicate 20% of their time to action, 60% to drama, and 20% to comedy, no matter the total hours viewed.

It’s this normalization that makes cosine similarity particularly effective for high-dimensional data such as text embeddings or user preferences.

When dealing with data of many dimensions (think hundreds or thousands of components of a vector for various features of a movie), it is often the relative significance of each dimension corresponding to the complete profile rather than the absolute values that matter most. Cosine similarity identifies precisely this arrangement of relative importance and is a powerful tool to identify meaningful relationships in complex data.

Hiking up the Euclidian Mountain Trail

Photo by Christian Mikhael on Unsplash

In this part, we will see how different approaches to measuring similarity behave in practice, with a concrete example from the real world and some little code example. Even if you are a non-techie, the code will be easy to understand for you as well. It’s to illustrate the simplicity of it all. No fear!

How about we quickly discuss a 10-mile-long hiking trail? Two friends, Alex and Blake, write trail reviews of the same hike, but each ascribes it a different character:

The trail gained 2,000 feet in elevation over just 2 miles! Easily doable with some high spikes in between!
Alex

and

Beware, we hiked 100 straight feet up in the forest terrain at the spike! Overall, 10 beautiful miles of forest!
Blake

These descriptions can be represented as vectors:

alex_description = [2000, 2]  # [elevation_gain, trail_distance]
blake_description = [100, 10]  # [elevation_gain, trail_distance]

Let’s combine both similarity measures and see what it tells us:

import numpy as np

def cosine_similarity(vec1, vec2):
    """
    Measures how similar the pattern or shape of two descriptions is,
    ignoring differences in scale. Returns 1.0 for perfectly aligned patterns.
    """
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

def euclidean_distance(vec1, vec2):
    """
    Measures the direct 'as-the-crow-flies' difference between descriptions.
    Smaller numbers mean descriptions are more similar.
    """
    return np.linalg.norm(np.array(vec1) - np.array(vec2))

# Alex focuses on the steep part: 2000ft elevation over 2 miles
alex_description = [2000, 2]  # [elevation_gain, trail_distance]

# Blake describes the whole trail: 100ft average elevation per mile over 10 miles
blake_description = [100, 10]  # [elevation_gain, trail_distance]

# Let's see how different these descriptions appear using each measure
print("Comparing how Alex and Blake described the same trail:")
print("\nEuclidean distance:", euclidean_distance(alex_description, blake_description))
print("(A larger number here suggests very different descriptions)")

print("\nCosine similarity:", cosine_similarity(alex_description, blake_description))
print("(A number close to 1.0 suggests similar patterns)")

# Let's also normalize the vectors to see what cosine similarity is looking at
alex_normalized = alex_description / np.linalg.norm(alex_description)
blake_normalized = blake_description / np.linalg.norm(blake_description)

print("\nAlex's normalized description:", alex_normalized)
print("Blake's normalized description:", blake_normalized)

So now, running this code, something magical happens:

Comparing how Alex and Blake described the same trail:

Euclidean distance: 8.124038404635959
(A larger number here suggests very different descriptions)

Cosine similarity: 0.9486832980505138
(A number close to 1.0 suggests similar patterns)

Alex's normalized description: [0.99975 0.02236]
Blake's normalized description: [0.99503 0.09950]

This output shows why, depending on what you are measuring, the same trail may appear different or similar.

The large Euclidean distance (8.12) suggests these are very different descriptions. It’s understandable that 2000 is a lot different from 100, and 2 is a lot different from 10. It’s like taking the raw difference between these numbers without understanding their meaning.

But the high Cosine similarity (0.95) tells us something more interesting: both descriptions capture a similar pattern.

If we look at the normalized vectors, we can see it, too; both Alex and Blake are describing a trail in which elevation gain is the prominent feature. The first number in each normalized vector (elevation gain) is much larger relative to the second (trail distance). Either that or elevating them both and normalizing based on proportion — not volume — since they both share the same trait defining the trail.

Perfectly true to life: Alex and Blake hiked the same trail but focused on different parts of it when writing their review. Alex focused on the steeper section and described a 100-foot climb, and Blake described the profile of the entire trail, averaged to 200 feet per mile over 10 miles. Cosine similarity identifies these descriptions as variations of the same basic trail pattern, whereas Euclidean distance regards them as completely different trails.

This example highlights the need to select the appropriate similarity measure. Normalizing and taking cosine similarity gives many meaningful correlations that are missed by just taking distances like Euclidean in real use cases.

Real-World Impacts of Metric Choices

Photo by fabio on Unsplash

The metric you pick doesn’t merely change the numbers; it influences the results of complex systems. Here’s how it breaks down in various domains:

In Recommendation Engines: When it comes to cosine similarity, we can group users who have the same tastes, even if they are doing different amounts of overall activity. A streaming service could use this to recommend movies that align with a user’s genre preferences, regardless of what is popular among a small subset of very active viewers.
In Document Retrieval: When querying a database of documents or research papers, cosine similarity ranks documents according to whether their content is similar in meaning to the user’s query, rather than their text length. This enables systems to retrieve results that are contextually relevant to the query, even though the documents are of a wide range of sizes.
In Fraud Detection: Patterns of behavior are often more important than pure numbers. Cosine similarity can be used to detect anomalies in spending habits, as it compares the direction of the transaction vectors — type of merchant, time of day, transaction amount, etc. — rather than the absolute magnitude.

And these differences matter because they give a sense of how systems “think”. Let’s get back to that credit card example one more time: It might, for example, identify a high-value $7,000 transaction for your new E-Bike as suspicious using Euclidean distance — even if that transaction is normal for you given you have an average spent of $20,000 a mont.

A cosine-based system, on the other hand, understands that the transaction is consistent with what the user typically spends their money on, thus avoiding unnecessary false notifications.

But measures like Euclidean distance and cosine similarity are not merely theoretical. They’re the blueprints on which real-world systems stand. Whether it’s recommendation engines or fraud detection, the metrics we choose will directly impact how systems make sense of relationships in data.

Vector Representations in Practice: Industry Transformations

Photo by Louis Reed on Unsplash

This ability for abstraction is what makes vector representations so powerful — they transform complex and abstract field data into concepts that can be scored and actioned. These insights are catalyzing fundamental transformations in business processes, decision-making, and customer value delivery across sectors.

Next, we will explore the solution use cases we are highlighting as concrete examples to see how vectors are freeing up time to solve big problems and creating new opportunities that have a big impact. I picked an industry to show what vector-based approaches to a challenge can achieve, so here is a healthcare example from a clinical setting. Why? Because it matters to us all and is rather easy to relate to than digging into the depths of the finance system, insurance, renewable energy, or chemistry.

Healthcare Spotlight: Pattern Recognition in Complex Medical Data

The healthcare industry poses a perfect storm of challenges that vector representations can uniquely solve. Think of the complexities of patient data: medical histories, genetic information, lifestyle factors, and treatment outcomes all interact in nuanced ways that traditional rule-based systems are incapable of capturing.

At Massachusetts General Hospital, researchers implemented a vector-based early detection system for sepsis, a condition in which every hour of early detection increases the chances of survival by 7.6% (see the full study at pmc.ncbi.nlm.nih.gov/articles/PMC6166236/).

In this new methodology, spontaneous neutrophil velocity profiles (SVP) are used to describe the movement patterns of neutrophils from a drop of blood. We won’t get too medically detailed here, because we’re vector-focused today, but a neutrophil is an immune cell that is kind of a first responder in what the body uses to fight off infections.

The system then encodes each neutrophil’s motion as a vector that captures not just its magnitude (i.e., speed), but also its direction. So they converted biological patterns to high-dimensional vector spaces; thus, they got subtle differences and showed that healthy individuals and sepsis patients exhibited statistically significant differences in movement. Then, these numeric vectors were processed with the help of a Machine Learning model that was trained to detect early signs of sepsis. The result was a diagnostic tool that reached impressive sensitivity (97%) and specificity (98%) to achieve a rapid and accurate identification of this fatal condition — probably with the cosine similarity (the paper doesn’t go into much detail, so this is pure speculation, but it would be the most suitable) that we just learned about a moment ago.

This is just one example of how medical data can be encoded into its vector representations and turned into malleable, actionable insights. This approach made it possible to re-contextualize complex relationships and, along with tread-based machine learning, worked around the limitations of previous diagnostic modalities and proved to be a potent tool for clinicians to save lives. It’s a powerful reminder that Vectors aren’t merely theoretical constructs — they’re practical, life-saving solutions that are powering the future of healthcare as much as your credit card risk detection software and hopefully also your business.

Lead and understand, or face disruption. The naked truth.

Photo by Hunters Race on Unsplash

With all you have read about by now: Think of a decision as small as the decision about the metrics under which data relationships are evaluated. Leaders risk making assumptions that are subtle yet disastrous. You are basically using algebra as a tool, and while getting some result, you cannot know if it is right or not: making leadership decisions without understanding the fundamentals of vectors is like calculating using a calculator but not knowing what formulas you are using.

The good news is this doesn’t mean that business leaders have to become data scientists. Vectors are delightful because, once the core ideas have been grasped, they become very easy to work with. An understanding of a handful of concepts (for example, how vectors encode relationships, why distance metrics are important, and how embedding models function) can fundamentally change how you make high-level decisions. These tools will help you ask better questions, work with technical teams more effectively, and make sound decisions about the systems that will govern your business.

The returns on this small investment in comprehension are huge. There is much talk about personalization. Yet, few organizations use vector-based thinking in their business strategies. It could help them leverage personalization to its full potential. Such an approach would delight customers with tailored experiences and build loyalty. You could innovate in areas like fraud detection and operational efficiency, leveraging subtle patterns in data that traditional ones miss — or perhaps even save lives, as described above. Equally important, you can avoid expensive missteps that happen when leaders defer to others for key decisions without understanding what they mean.

The truth is, vectors are here now, driving a vast majority of all the hyped AI technology behind the scenes to help create the world we navigate in today and tomorrow. Companies that do not adapt their leadership to think in vectors risk falling behind a competitive landscape that becomes ever more data-driven. One who adopts this new paradigm will not just survive but will prosper in an age of never-ending AI innovation.

Now is the moment to act. Start to view the world through vectors. Study their tongue, examine their doctrine, and ask how the new could change your tactics and your lodestars. Much in the way that algebra became an essential tool for writing one’s way through practical life challenges, vectors will soon serve as the literacy of the data age. Actually they do already. It is the future of which the powerful know how to take control. The question is not if vectors will define the next era of businesses; it is whether you are prepared to lead it.

The post The Invisible Revolution: How Vectors Are (Re)defining Business Success appeared first on Towards Data Science.

How to Measure Real Model Accuracy When Labels Are Noisy

Krishna Rao — Thu, 10 Apr 2025 19:22:26 +0000

Ground truth is never perfect. From scientific measurements to human annotations used to train deep learning models, ground truth always has some amount of errors. ImageNet, arguably the most well-curated image dataset has 0.3% errors in human annotations. Then, how can we evaluate predictive models using such erroneous labels?

In this article, we explore how to account for errors in test data labels and estimate a model’s “true” accuracy.

Example: image classification

Let’s say there are 100 images, each containing either a cat or a dog. The images are labeled by human annotators who are known to have 96% accuracy (Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ). If we train an image classifier on some of this data and find that it has 90% accuracy on a hold-out set (Aᵐᵒᵈᵉˡ), what is the “true” accuracy of the model (Aᵗʳᵘᵉ)? A couple of observations first:

Within the 90% of predictions that the model got “right,” some examples may have been incorrectly labeled, meaning both the model and the ground truth are wrong. This artificially inflates the measured accuracy.
Conversely, within the 10% of “incorrect” predictions, some may actually be cases where the model is right and the ground truth label is wrong. This artificially deflates the measured accuracy.

Given these complications, how much can the true accuracy vary?

Range of true accuracy

True accuracy of model for perfectly correlated and perfectly uncorrelated errors of model and label. Figure by author.

The true accuracy of our model depends on how its errors correlate with the errors in the ground truth labels. If our model’s errors perfectly overlap with the ground truth errors (i.e., the model is wrong in exactly the same way as human labelers), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 — (1–0.96) = 86%

Alternatively, if our model is wrong in exactly the opposite way as human labelers (perfect negative correlation), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 + (1–0.96) = 94%

Or more generally:

Aᵗʳᵘᵉ = Aᵐᵒᵈᵉˡ ± (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

It’s important to note that the model’s true accuracy can be both lower and higher than its reported accuracy, depending on the correlation between model errors and ground truth errors.

Probabilistic estimate of true accuracy

In some cases, inaccuracies among labels are randomly spread among the examples and not systematically biased toward certain labels or regions of the feature space. If the model’s inaccuracies are independent of the inaccuracies in the labels, we can derive a more precise estimate of its true accuracy.

When we measure Aᵐᵒᵈᵉˡ (90%), we’re counting cases where the model’s prediction matches the ground truth label. This can happen in two scenarios:

Both model and ground truth are correct. This happens with probability Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ.
Both model and ground truth are wrong (in the same way). This happens with probability (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ).

Under independence, we can express this as:

Aᵐᵒᵈᵉˡ = Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ + (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

Rearranging the terms, we get:

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ + Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1) / (2 × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1)

In our example, that equals (0.90 + 0.96–1) / (2 × 0.96–1) = 93.5%, which is within the range of 86% to 94% that we derived above.

The independence paradox

Plugging in Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ as 0.96 from our example, we get

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ — 0.04) / (0.92). Let’s plot this below.

True accuracy as a function of model’s reported accuracy when ground truth accuracy = 96%. Figure by author.

Strange, isn’t it? If we assume that model’s errors are uncorrelated with ground truth errors, then its true accuracy Aᵗʳᵘᵉ is always higher than the 1:1 line when the reported accuracy is > 0.5. This holds true even if we vary Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ:

Model’s “true” accuracy as a function of its reported accuracy and ground truth accuracy. Figure by author.

Error correlation: why models often struggle where humans do

The independence assumption is crucial but often doesn’t hold in practice. If some images of cats are very blurry, or some small dogs look like cats, then both the ground truth and model errors are likely to be correlated. This causes Aᵗʳᵘᵉ to be closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)) than the upper bound.

More generally, model errors tend to be correlated with ground truth errors when:

Both humans and models struggle with the same “difficult” examples (e.g., ambiguous images, edge cases)
The model has learned the same biases present in the human labeling process
Certain classes or examples are inherently ambiguous or challenging for any classifier, human or machine
The labels themselves are generated from another model
There are too many classes (and thus too many different ways of being wrong)

Best practices

The true accuracy of a model can differ significantly from its measured accuracy. Understanding this difference is crucial for proper model evaluation, especially in domains where obtaining perfect ground truth is impossible or prohibitively expensive.

When evaluating model performance with imperfect ground truth:

Conduct targeted error analysis: Examine examples where the model disagrees with ground truth to identify potential ground truth errors.
Consider the correlation between errors: If you suspect correlation between model and ground truth errors, the true accuracy is likely closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)).
Obtain multiple independent annotations: Having multiple annotators can help estimate ground truth accuracy more reliably.

Conclusion

In summary, we learned that:

The range of possible true accuracy depends on the error rate in the ground truth
When errors are independent, the true accuracy is often higher than measured for models better than random chance
In real-world scenarios, errors are rarely independent, and the true accuracy is likely closer to the lower bound

The post How to Measure Real Model Accuracy When Labels Are Noisy appeared first on Towards Data Science.

Ivory Tower Notes: The Problem

Marina Tosic — Thu, 10 Apr 2025 18:48:08 +0000

Did you ever spend months on a Machine Learning project, only to discover you never defined the “correct” problem at the start? If so, or even if not, and you are only starting with the data science or AI field, welcome to my first Ivory Tower Note, where I will address this topic.

The term “Ivory Tower” is a metaphor for a situation in which someone is isolated from the practical realities of everyday life. In academia, the term often refers to researchers who engage deeply in theoretical pursuits and remain distant from the realities that practitioners face outside academia.

As a former researcher, I wrote a short series of posts from my old Ivory Tower notes — the notes before the LLM era.

Scary, I know. I am writing this to manage expectations and the question, “Why ever did you do things this way?” — “Because no LLM told me how to do otherwise 10+ years ago.”

That’s why my notes contain “legacy” topics such as data mining, machine learning, multi-criteria decision-making, and (sometimes) human interactions, airplanes and art.

Nonetheless, whenever there is an opportunity, I will map my “old” knowledge to generative AI advances and explain how I applied it to datasets beyond the Ivory Tower.

Welcome to post #1…

How every Machine Learning and AI journey starts

— It starts with a problem.

For you, this is usually “the” problem because you need to live with it for months or, in the case of research, years.

With “the” problem, I am addressing the business problem you don’t fully understand or know how to solve at first.

An even worse scenario is when you think you fully understand and know how to solve it quickly. This then creates only more problems that are again only yours to solve. But more about this in the upcoming sections.

So, what’s “the” problem about?

Causa: It’s mostly about not managing or leveraging resources properly — workforce, equipment, money, or time.

Ratio: It’s usually about generating business value, which can span from improved accuracy, increased productivity, cost savings, revenue gains, faster reaction, decision, planning, delivery or turnaround times.

Veritas: It’s always about finding a solution that relies and is hidden somewhere in the existing dataset.

Or, more than one dataset that someone labelled as “the one”, and that’s waiting for you to solve the problem. Because datasets follow and are created from technical or business process logs, “there has to be a solution lying somewhere within them.”

Ah, if only it were so easy.

Avoiding a different chain of thought again, the point is you will need to:

1 — Understand the problem fully,
2 — If not given, find the dataset “behind” it, and
3 — Create a methodology to get to the solution that will generate business value from it.

On this path, you will be tracked and measured, and time will not be on your side to deliver the solution that will solve “the universe equation.”

That’s why you will need to approach the problem methodologically, drill down to smaller problems first, and focus entirely on them because they are the root cause of the overall problem.

That’s why it’s good to learn how to…

Think like a Data Scientist.

Returning to the problem itself, let’s imagine that you are a tourist lost somewhere in the big museum, and you want to figure out where you are. What you do next is walk to the closest info map on the floor, which will show your current location.

At this moment, in front of you, you see something like this:

Data Science Process. Image by Author, inspired by Microsoft Learn

The next thing you might tell yourself is, “I want to get to Frida Kahlo’s painting.” (Note: These are the insights you want to get.)

Because your goal is to see this one painting that brought you miles away from your home and now sits two floors below, you head straight to the second floor. Beforehand, you memorized the shortest path to reach your goal. (Note: This is the initial data collection and discovery phase.)

However, along the way, you stumble upon some obstacles — the elevator is shut down for renovation, so you have to use the stairs. The museum paintings were reordered just two days ago, and the info plans didn’t reflect the changes, so the path you had in mind to get to the painting is not accurate.

Then you find yourself wandering around the third floor already, asking quietly again, “How do I get out of this labyrinth and get to my painting faster?”

While you don’t know the answer, you ask the museum staff on the third floor to help you out, and you start collecting the new data to get the correct route to your painting. (Note: This is a new data collection and discovery phase.)

Nonetheless, once you get to the second floor, you get lost again, but what you do next is start noticing a pattern in how the paintings have been ordered chronologically and thematically to group the artists whose styles overlap, thus giving you an indication of where to go to find your painting. (Note: This is a modelling phase overlapped with the enrichment phase from the dataset you collected during school days — your art knowledge.)

Finally, after adapting the pattern analysis and recalling the collected inputs on the museum route, you arrive in front of the painting you had been planning to see since booking your flight a few months ago.

What I described now is how you approach data science and, nowadays, generative AI problems. You always start with the end goal in mind and ask yourself:

“What is the expected outcome I want or need to get from this?”

Then you start planning from this question backwards. The example above started with requesting holidays, booking flights, arranging accommodation, traveling to a destination, buying museum tickets, wandering around in a museum, and then seeing the painting you’ve been reading about for ages.

Of course, there is more to it, and this process should be approached differently if you need to solve someone else’s problem, which is a bit more complex than locating the painting in the museum.

In this case, you have to…

Ask the “good” questions.

To do this, let’s define what a good question means [1]:

A good data science question must be concrete, tractable, and answerable. Your question works well if it naturally points to a feasible approach for your project. If your question is too vague to suggest what data you need, it won’t effectively guide your work.

Formulating good questions keeps you on track so you don’t get lost in the data that should be used to get to the specific problem solution, or you don’t end up solving the wrong problem.

Going into more detail, good questions will help identify gaps in reasoning, avoid faulty premises, and create alternative scenarios in case things do go south (which almost always happens).

Image created by Author after analyzing “Chapter 2. Setting goals by asking good questions” from “Think Like a Data Scientist” book [2]

From the above-presented diagram, you understand how good questions, first and foremost, need to support concrete assumptions. This means they need to be formulated in a way that your premises are clear and ensure they can be tested without mixing up facts with opinions.

Good questions produce answers that move you closer to your goal, whether through confirming hypotheses, providing new insights, or eliminating wrong paths. They are measurable, and with this, they connect to project goals because they are formulated with consideration of what’s possible, valuable, and efficient [2].

Good questions are answerable with available data, considering current data relevance and limitations.

Last but not least, good questions anticipate obstacles. If something is certain in data science, this is the uncertainty, so having backup plans when things don’t work as expected is important to produce results for your project.

Let’s exemplify this with one use case of an airline company that has a challenge with increasing its fleet availability due to unplanned technical groundings (UTG).

These unexpected maintenance events disrupt flights and cost the company significant money. Because of this, executives decided to react to the problem and call in a data scientist (you) to help them improve aircraft availability.

Now, if this would be the first data science task you ever got, you would maybe start an investigation by asking:

“How can we eliminate all unplanned maintenance events?”

You understand how this question is an example of the wrong or “poor” one because:

It is not realistic: It includes every possible defect, both small and big, into one impossible goal of “zero operational interruptions”.
It doesn’t hold a measure of success: There’s no concrete metric to show progress, and if you’re not at zero, you’re at “failure.”
It is not data-driven: The question didn’t cover which data is recorded before delays occur, and how the aircraft unavailability is measured and reported from it.

So, instead of this vague question, you would probably ask a set of targeted questions:

Which aircraft (sub)system is most critical to flight disruptions?
(Concrete, specific, answerable) This question narrows down your scope, focusing on only one or two specific (sub) systems affecting most delays.
What constitutes “critical downtime” from an operational perspective?
(Valuable, ties to business goals) If the airline (or regulatory body) doesn’t define how many minutes of unscheduled downtime matter for schedule disruptions, you might waste effort solving less urgent issues.
Which data sources capture the root causes, and how can we fuse them?
(Manageable, narrows the scope of the project further) This clarifies which data sources one would need to find the problem solution.

With these sharper questions, you will drill down to the real problem:

Not all delays weigh the same in cost or impact. The “correct” data science problem is to predict critical subsystem failures that lead to operationally costly interruptions so maintenance crews can prioritize them.

That’s why…

Defining the problem determines every step after.

It’s the foundation upon which your data, modelling, and evaluation phases are built .

Image created by Author after analyzing and overlapping different images from “Chapter 2. Setting goals by asking good questions, Think Like a Data Scientist” book [2]

It means you are clarifying the project’s objectives, constraints, and scope; you need to articulate the ultimate goal first and, except for asking “What’s the expected outcome I want or need to get from this?”, ask as well:

What would success look like and how can we measure it?

From there, drill down to (possible) next-level questions that you (I) have learned from the Ivory Tower days:
— History questions: “Has anyone tried to solve this before? What happened? What is still missing?”
— Context questions: “Who is affected by this problem and how? How are they partially resolving it now? Which sources, methods, and tools are they using now, and can they still be reused in the new models?”
— Impact Questions: “What happens if we don’t solve this? What changes if we do? Is there a value we can create by default? How much will this approach cost?”
— Assumption Questions: “What are we taking for granted that might not be true (especially when it comes to data and stakeholders’ ideas)?”
— ….

Then, do this in the loop and always “ask, ask again, and don’t stop asking” questions so you can drill down and understand which data and analysis are needed and what the ground problem is.

This is the evergreen knowledge you can apply nowadays, too, when deciding if your problem is of a predictive or generative nature.

(More about this in some other note where I will explain how problematic it is trying to solve the problem with the models that have never seen — or have never been trained on — similar problems before.)

Now, going back to memory lane…

I want to add one important note: I have learned from late nights in the Ivory Tower that no amount of data or data science knowledge can save you if you’re solving the wrong problem and trying to get the solution (answer) from a question that was simply wrong and vague.

When you have a problem on hand, do not rush into assumptions or building the models without understanding what you need to do (Festina lente).

In addition, prepare yourself for unexpected situations and do a proper investigation with your stakeholders and domain experts because their patience will be limited, too.

With this, I want to say that the “real art” of being successful in data projects is knowing precisely what the problem is, figuring out if it can be solved in the first place, and then coming up with the “how” part.

You get there by learning to ask good questions.

To end this narrative, recall how Einstein famously said:

If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.

Thank you for reading, and stay tuned for the next Ivory Tower note.

If you found this post valuable, feel free to share it with your network.

Connect for more stories on Medium and LinkedIn .

References:

[1] DS4Humans, Backwards Design, accessed: April 5th 2025, https://ds4humans.com/40_in_practice/05_backwards_design.html#defining-a-good-question

[2] Godsey, B. (2017), Think Like a Data Scientist: Tackle the data science process step-by-step, Manning Publications.

The post Ivory Tower Notes: The Problem appeared first on Towards Data Science.

Time Series Forecasting Made Simple (Part 1): Decomposition and Baseline Models

Nikhil Dasari — Wed, 09 Apr 2025 19:53:52 +0000

I used to avoid time series analysis. Every time I took an online course, I’d see a module titled “Time Series Analysis” with subtopics like Fourier Transforms, autocorrelation functions and other intimidating terms. I don’t know why, but I always found a reason to avoid it.

But here’s what I’ve learned: any complex topic becomes manageable when we start from the basics and focus on understanding the intuition behind it. That’s exactly what this blog series is about : making time series feel less like a maze and more like a conversation with your data over time.

We understand complex topics much more easily when they’re explained through real-world examples. That’s exactly how I’ll approach this series.

In each post, we’ll work with a simple dataset and explore what’s needed from a time series perspective. We’ll build intuition around each concept, understand why it matters, and implement it step by step on the data.

Time Series Analysis is the process of understanding, modeling and Forecasting data that is observed over time. It involves identifying patterns such as trends, seasonality and noise using past observations to make informed predictions about future values.

Let’s start by considering a dataset named Daily Minimum Temperatures in Melbourne (open license). This dataset contains daily records of the lowest temperature (in Celsius) observed in Melbourne, Australia, over a 10-year period from 1981 to 1990. Each entry includes just two columns:

Date: The calendar day (from 1981-01-01 to 1990-12-31)
Temp: The minimum temperature recorded on that day

You’ve probably heard of models like ARIMA, SARIMA or Exponential Smoothing. But before we go there, it’s a good idea to try out some simple baseline models first, to see how well a basic approach performs on our data.

While there are many types of baseline models used in time series forecasting, here we’ll focus on the three most essential ones, which are simple, effective, and widely applicable across industries.

Naive Forecast: Assumes the next value will be the same as the last observed one.
Seasonal Naive Forecast: Assumes the value will repeat from the same point last season (e.g., last week or last month).
Moving Average: Takes the average of the last n points.

You might be wondering, why use baseline models at all? Why not just go straight to the well-known forecasting methods like ARIMA or SARIMA?

Let’s consider a shop owner who wants to forecast next month’s sales. By applying a moving average baseline model, they can estimate next month’s sales as the average of previous months. This simple approach might already deliver around 80% accuracy — good enough for planning and inventory decisions.

Now, if we switch to a more advanced model like ARIMA or SARIMA, we might increase accuracy to around 85%. But the key question is: is that extra 5% worth the additional time, effort and resources? In this case, the baseline model does the job.

In fact, in most everyday business scenarios, baseline models are sufficient. We typically turn to classical models like ARIMA or SARIMA in high-impact industries such as finance or energy, where even a small improvement in accuracy can have a significant financial or operational impact. Even then, a baseline model is usually applied first — not only to provide quick insights but also to act as a benchmark that more complex models must outperform.

Okay, now that we’re ready to implement some baseline models, there’s one key thing we need to understand first:
Every time series is made up of three main components — trend, seasonality and residuals.

Time series decomposition separates data into trend, seasonality and residuals (noise), helping us uncover the true patterns beneath the surface. This understanding guides the choice of forecasting models and improves accuracy. It’s also a vital first step before building both simple and advanced forecasting solutions.

Trend
This is the overall direction your data is moving in over time — going up, down or staying flat.
Example: Steady decrease in monthly cigarette sales.

Seasonality
These are the patterns that repeat at regular intervals — daily, weekly, monthly or yearly.
Example: Cool drinks sales in summer.

Residuals (Noise)
This is the random “leftover” part of the data, the unpredictable ups and downs that can’t be explained by trend or seasonality.
Example: A one-time car purchase showing up in your monthly expense pattern.

Now that we understand the key components of a time series, let’s put that into practice using a real dataset: Daily Minimum Temperatures in Melbourne, Australia.

We’ll use Python to decompose the time series into its trend, seasonality, and residual components so we can better understand its structure and choose an appropriate baseline model.

Code:

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# Load the dataset
df = pd.read_csv("minimum daily temperatures data.csv")

# Convert 'Date' to datetime and set as index
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.set_index('Date', inplace=True)

# Set a regular daily frequency and fill missing values using forward fill
df = df.asfreq('D')
df['Temp'].fillna(method='ffill', inplace=True)

# Decompose the daily series (365-day seasonality for yearly patterns)
decomposition = seasonal_decompose(df['Temp'], model='additive', period=365)

# Plot the decomposed components
decomposition.plot()
plt.suptitle('Decomposition of Daily Minimum Temperatures (Daily)', fontsize=14)
plt.tight_layout()
plt.show()

Output:

Decomposition of daily temperatures showing trend, seasonal cycles and random fluctuations.

The decomposition plot clearly shows a strong seasonal pattern that repeats each year, along with a mild trend that shifts over time. The residual component captures the random noise that isn’t explained by trend or seasonality.

In the code earlier, you might have noticed that I used an additive model for decomposing the Time Series. But what exactly does that mean — and why is it the right choice for this dataset?

Let’s break it down.
In an additive model, we assume Trend, Seasonality and Residuals (Noise) combine linearly, like this:
Y = T + S + R

Where:
Y is the actual value at time t
T is the trend
S is the seasonal component
R is the residual (random noise)

This means we’re treating the observed value as the sum of the parts, each component contributes independently to the final output.

I chose the additive model because when I looked at the pattern in daily minimum temperatures, I noticed something important:

The line plot above shows the daily minimum temperatures from 1981 to 1990. We can clearly see a strong seasonal cycle that repeats each year, colder temperatures in winter, warmer in summer.

Importantly, the amplitude of these seasonal swings stays relatively consistent over the years. For example, the temperature difference between summer and winter doesn’t appear to grow or shrink over time. This stability in seasonal variation is a key sign that the additive model is appropriate for decomposition, since the seasonal component appears to be independent of any trend.

We use an additive model when the trend is relatively stable and does not amplify or distort the seasonal pattern, and when the seasonality stays within a consistent range over time, even if there are minor fluctuations.

Now that we understand how the additive model works, let’s explore the multiplicative model — which is often used when the seasonal effect scales with the trend which will also help us understand the additive model more clearly.

Consider a household’s electricity consumption. Suppose the household uses 20% more electricity in summer compared to winter. That means the seasonal effect isn’t a fixed number — it’s a proportion of their baseline usage.

Let’s see how this looks with real numbers:

In 2021, the household used 300 kWh in winter and 360 kWh in summer (20% more than winter).

In 2022, their winter consumption increased to 330 kWh, and summer usage rose to 396 kWh (still 20% more than winter).

In both years, the seasonal difference grows with the trend from +60 kWh in 2021 to +66 kWh in 2022 even though the percentage increase stays the same. This is exactly the kind of behavior that a multiplicative model captures well.

In mathematical terms:
Y = T ×S ×R
Where:
Y: Observed value
T: Trend component
S: Seasonal component
R: Residual (noise)

By looking at the decomposition plot, we can figure out whether an additive or multiplicative model fits our data better.

There are also other powerful decomposition tools available, which I’ll be covering in one of my upcoming blog posts.Now that we have a clear understanding of additive and multiplicative models, let’s shift our focus to applying a baseline model that fits this dataset.

Based on the decomposition plot, we can see a strong seasonal pattern in the data, which suggests that a Seasonal Naive model might be a good fit for this time series.

This model assumes that the value at a given time will be the same as it was in the same period of the previous season — making it a simple yet effective choice when seasonality is dominant and consistent. For example, if temperatures typically follow the same yearly cycle, then the forecast for July 1st, 1990, would simply be the temperature recorded on July 1st, 1989.

Code:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
df = pd.read_csv("minimum daily temperatures data.csv")

# Convert 'Date' column to datetime and set as index
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.set_index('Date', inplace=True)

# Ensure regular daily frequency and fill missing values
df = df.asfreq('D')
df['Temp'].fillna(method='ffill', inplace=True)

# Step 1: Create the Seasonal Naive Forecast
seasonal_period = 365  # Assuming yearly seasonality for daily data
# Create the Seasonal Naive forecast by shifting the temperature values by 365 days
df['Seasonal_Naive'] = df['Temp'].shift(seasonal_period)

# Step 2: Plot the actual vs forecasted values
# Plot the last 2 years (730 days) of data to compare
plt.figure(figsize=(12, 5))
plt.plot(df['Temp'][-730:], label='Actual')
plt.plot(df['Seasonal_Naive'][-730:], label='Seasonal Naive Forecast', linestyle='--')
plt.title('Seasonal Naive Forecast vs Actual Temperatures')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.tight_layout()
plt.show()

# Step 3: Evaluate using MAPE (Mean Absolute Percentage Error)
# Use the last 365 days for testing
test = df[['Temp', 'Seasonal_Naive']].iloc[-365:].copy()
test.dropna(inplace=True)

# MAPE Calculation
mape = np.mean(np.abs((test['Temp'] - test['Seasonal_Naive']) / test['Temp'])) * 100
print(f"MAPE (Seasonal Naive Forecast): {mape:.2f}%")

Output:

Seasonal Naive Forecast vs. Actual Temperatures (1989–1990)

To keep the visualization clear and focused, we’ve plotted the last two years of the dataset (1989–1990) instead of all 10 years.

This plot compares the actual daily minimum temperatures in Melbourne with the values predicted by the Seasonal Naive model, which simply assumes that each day’s temperature will be the same as it was on the same day one year ago.

As seen in the plot, the Seasonal Naive forecast captures the broad shape of the seasonal cycles quite well — it mirrors the rise and fall of temperatures throughout the year. However, it doesn’t capture day-to-day variations, nor does it respond to slight shifts in seasonal timing. This is expected, as the model is designed to repeat the previous year’s pattern exactly, without adjusting for trend or noise.

To evaluate how well this model performs, we calculate the Mean Absolute Percentage Error (MAPE) over the final 365 days of the dataset (i.e., 1990). We only use this period because the Seasonal Naive forecast needs a full year of historical data before it can begin making predictions.

Mean Absolute Percentage Error (MAPE) is a commonly used metric to evaluate the accuracy of forecasting models. It measures the average absolute difference between the actual and predicted values, expressed as a percentage of the actual values.

In time series forecasting, we typically evaluate model performance on the most recent or target time period — not on the middle years. This reflects how forecasts are used in the real world: we build models on historical data to predict what’s coming next.

That’s why we calculate MAPE only on the final 365 days of the dataset — this simulates forecasting for a future and gives us a realistic measure of how well the model would perform in practice.

A MAPE of 28.23%, which gives us a baseline level of forecasting error. Any model we build next — whether it’s customized or more advanced, should aim to outperform this benchmark.

A MAPE of 28.23% means that, on average, the model’s predictions were 28.23% off from the actual daily temperature values over the last year.

In other words, if the true temperature on a given day was 10°C, the Seasonal Naïve forecast might have been around 7.2°C or 12.8°C, reflecting a 28% deviation.

I’ll dive deeper into evaluation metrics in a future post.

In this post, we laid the foundation for time series forecasting by understanding how real-world data can be broken down into trend, seasonality, and residuals through decomposition. We explored the difference between additive and multiplicative models, implemented the Seasonal Naive baseline forecast and evaluated its performance using MAPE.

While the Seasonal Naive model is simple and intuitive, it comes with limitations especially for this dataset. It assumes that the temperature on any given day is identical to the same day last year. But as the plot and MAPE of 28.23% showed, this assumption doesn’t hold perfectly. The data displays slight shifts in seasonal patterns and long-term variations that the model fails to capture.

In the next part of this series, we’ll go further. We’ll explore how to customize a baseline model, compare it to the Seasonal Naive approach and evaluate which one performs better using error metrics like MAPE, MAE and RMSE.

We’ll also begin building the foundation needed to understand more advanced models like ARIMA including key concepts such as:

Stationarity
Autocorrelation and Partial Autocorrelation
Differencing
Lag-based modeling (AR and MA terms)

Part 2 will dive into these topics in more detail, starting with custom baselines and ending with the foundations of ARIMA.

Thanks for reading. I hope you found this post helpful and insightful.

The post Time Series Forecasting Made Simple (Part 1): Decomposition and Baseline Models appeared first on Towards Data Science.

Mining Rules from Data

Mariya Mansurova — Wed, 09 Apr 2025 16:54:40 +0000

Working with products, we might face a need to introduce some “rules”. Let me explain what I mean by “rules” in practical examples:

Imagine that we’re seeing a massive wave of fraud in our product, and we want to restrict onboarding for a particular segment of customers to lower this risk. For example, we found out that the majority of fraudsters had specific user agents and IP addresses from certain countries.
Another option is to send coupons to customers to use in our online shop. However, we would like to treat only customers who are likely to churn since loyal users will return to the product anyway. We might figure out that the most feasible group is customers who joined less than a year ago and decreased their spending by 30%+ last month.
Transactional businesses often have a segment of customers where they are losing money. For example, a bank customer passed the verification and regularly reached out to customer support (so generated onboarding and servicing costs) while doing almost no transactions (so not generating any revenue). The bank might introduce a small monthly subscription fee for customers with less than 1000$ in their account since they are likely non-profitable.

Of course, in all these cases, we might have used a complex Machine Learning model that would take into account all the factors and predict the probability (either of a customer being a fraudster or churning). Still, under some circumstances, we might prefer just a set of static rules for the following reasons:

The speed and complexity of implementation. Deploying an ML model in production takes time and effort. If you are experiencing a fraud wave right now, it might be more feasible to go live with a set of static rules that can be implemented quickly and then work on a comprehensive solution.
Interpretability. ML models are black boxes. Even though we might be able to understand at a high level how they work and what features are the most important ones, it’s challenging to explain them to customers. In the example of subscription fees for non-profitable customers, it’s important to share a set of transparent rules with customers so that they can understand the pricing.
Compliance. Some industries, like finance or healthcare, might require auditable and rule-based decisions to meet compliance requirements.

In this article, I want to show you how we can solve business problems using such rules. We will take a practical example and go really deep into this topic:

we will discuss which models we can use to mine such rules from data,
we will build a Decision Tree Classifier from scratch to learn how it works,
we will fit the sklearn Decision Tree Classifier model to extract the rules from the data,
we will learn how to parse the Decision Tree structure to get the resulting segments,
finally, we will explore different options for category encoding, since the sklearn implementation doesn’t support categorical variables.

We have lots of topics to cover, so let’s jump into it.

Case

As usual, it’s easier to learn something with a practical example. So, let’s start by discussing the task we will be solving in this article.

We will work with the Bank Marketing dataset (CC BY 4.0 license). This dataset contains data about the direct marketing campaigns of a Portuguese banking institution. For each customer, we know a bunch of features and whether they subscribed to a term deposit (our target).

Our business goal is to maximise the number of conversions (subscriptions) with limited operational resources. So, we can’t call the whole user base, and we want to reach the best outcome with the resources we have.

The first step is to look at the data. So, let’s load the data set.

import pandas as pd
pd.set_option('display.max_colwidth', 5000)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

df = pd.read_csv('bank-full.csv', sep = ';')
df = df.drop(['duration', 'campaign'], axis = 1)
# removed columns related to the current marketing campaign, 
# since they introduce data leakage

df.head()

We know quite a lot about the customers, including personal data (such as job type or marital status) and their previous behaviour (such as whether they have a loan or their average yearly balance).

Image by author

The next step is to select a machine-learning model. There are two classes of models that are usually used when we need something easily interpretable:

decision trees,
linear or logistic regression.

Both options are feasible and can give us good models that can be easily implemented and interpreted. However, in this article, I would like to stick to the decision tree model because it produces actual rules, while logistic regression will give us probability as a weighted sum of features.

Data Preprocessing

As we’ve seen in the data, there are lots of categorical variables (such as education or marital status). Unfortunately, the sklearn decision tree implementation can’t handle categorical data, so we need to do some preprocessing.

Let’s start by transforming yes/no flags into integers.

for p in ['default', 'housing', 'loan', 'y']:
    df[p] = df[p].map(lambda x: 1 if x == 'yes' else 0)

The next step is to transform the month variable. We can use one-hot encoding for months, introducing flags like month_jan , month_feb , etc. However, there might be seasonal effects, and I think it would be more reasonable to convert months into integers following their order.

month_map = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
# I saved 5 mins by asking ChatGPT to do this mapping

df['month'] = df.month.map(lambda x: month_map[x] if x in month_map else x)

For all other categorical variables, let’s use one-hot encoding. We will discuss different strategies for category encoding later, but for now, let’s stick to the default approach.

The easiest way to do one-hot encoding is to leverage get_dummies function in pandas.

fin_df = pd.get_dummies(
  df, columns=['job', 'marital', 'education', 'poutcome', 'contact'], 
  dtype = int, # to convert to flags 0/1
  drop_first = False # to keep all possible values
)

This function transforms each categorical variable into a separate 1/0 column for each possible. We can see how it works for poutcome column.

fin_df.merge(df[['id', 'poutcome']])\
    .groupby(['poutcome', 'poutcome_unknown', 'poutcome_failure', 
      'poutcome_other', 'poutcome_success'], as_index = False).y.count()\
    .rename(columns = {'y': 'cases'})\
    .sort_values('cases', ascending = False)

Image by author

Our data is now ready, and it’s time to discuss how decision tree classifiers work.

Decision Tree Classifier: Theory

In this section, we’ll explore the theory behind the Decision Tree Classifier and build the algorithm from scratch. If you’re more interested in a practical example, feel free to skip ahead to the next part.

The easiest way to understand the decision tree model is to look at an example. So, let’s build a simple model based on our data. We will use DecisionTreeClassifier from sklearn.

feature_names = fin_df.drop(['y'], axis = 1).columns
model = sklearn.tree.DecisionTreeClassifier(
  max_depth = 2, min_samples_leaf = 1000)
model.fit(fin_df[feature_names], fin_df['y'])

The next step is to visualise the tree.

dot_data = sklearn.tree.export_graphviz(
    model, out_file=None, feature_names = feature_names, filled = True, 
    proportion = True, precision = 2 
    # to show shares of classes instead of absolute numbers
)

graph = graphviz.Source(dot_data)
graph

Image by author

So, we can see that the model is straightforward. It’s a set of binary splits that we can use as heuristics.

Let’s figure out how the classifier works under the hood. As usual, the best way to understand the model is to build the logic from scratch.

The cornerstone of any problem is the optimisation function. By default, in the decision tree classifier, we’re optimising the Gini coefficient. Imagine getting one random item from the sample and then the other. The Gini coefficient would equal the probability of the situation when these items are from different classes. So, our goal will be minimising the Gini coefficient.

In the case of just two classes (like in our example, where marketing intervention was either successful or not), the Gini coefficient is defined just by one parameter p , where p is the probability of getting an item from one of the classes. Here’s the formula:

\[\textbf{gini}(\textsf{p}) = 1 – \textsf{p}^2 – (1 – \textsf{p})^2 = 2 * \textsf{p} * (1 – \textsf{p}) \]

If our classification is ideal and we are able to separate the classes perfectly, then the Gini coefficient will be equal to 0. The worst-case scenario is when p = 0.5 , then the Gini coefficient is also equal to 0.5.

With the formula above, we can calculate the Gini coefficient for each leaf of the tree. To calculate the Gini coefficient for the whole tree, we need to combine the Gini coefficients of binary splits. For that, we can just get a weighted sum:

\[\textbf{gini}_{\textsf{total}} = \textbf{gini}_{\textsf{left}} * \frac{\textbf{n}_{\textsf{left}}}{\textbf{n}_{\textsf{left}} + \textbf{n}_{\textsf{right}}} + \textbf{gini}_{\textsf{right}} * \frac{\textbf{n}_{\textsf{right}}}{\textbf{n}_{\textsf{left}} + \textbf{n}_{\textsf{right}}}\]

Now that we know what value we’re optimising, we only need to define all possible binary splits, iterate through them and choose the best option.

Defining all possible binary splits is also quite straightforward. We can do it one by one for each parameter, sort possible values, and pick up thresholds between them. For example, for months (integer from 1 to 12).

Image by author

Let’s try to code it and see whether we will come to the same result. First, we will define functions that calculate the Gini coefficient for one dataset and the combination.

def get_gini(df):
    p = df.y.mean()
    return 2*p*(1-p)

print(get_gini(fin_df)) 
# 0.2065
# close to what we see at the root node of Decision Tree

def get_gini_comb(df1, df2):
    n1 = df1.shape[0]
    n2 = df2.shape[0]

    gini1 = get_gini(df1)
    gini2 = get_gini(df2)
    return (gini1*n1 + gini2*n2)/(n1 + n2)

The next step is to get all possible thresholds for one parameter and calculate their Gini coefficients.

import tqdm
def optimise_one_parameter(df, param):
    tmp = []
    possible_values = list(sorted(df[param].unique()))
    print(param)

    for i in tqdm.tqdm(range(1, len(possible_values))): 
        threshold = (possible_values[i-1] + possible_values[i])/2
        gini = get_gini_comb(df[df[param] <= threshold], 
          df[df[param] > threshold])
        tmp.append(
            {'param': param, 
            'threshold': threshold, 
            'gini': gini, 
            'sizes': (df[df[param] <= threshold].shape[0], df[df[param] > threshold].shape[0]))
            }
        )
    return pd.DataFrame(tmp)

The final step is to iterate through all features and calculate all possible splits.

tmp_dfs = []
for feature in feature_names:
    tmp_dfs.append(optimise_one_parameter(fin_df, feature))
opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', asceding = True).head(5)

Image by author

Wonderful, we’ve got the same result as in our DecisionTreeClassifier model. The optimal split is whether poutcome = success or not. We’ve reduced the Gini coefficient from 0.2065 to 0.1872.

To continue building the tree, we need to repeat the process recursively. For example, going down for the poutcome_success <= 0.5 branch:

tmp_dfs = []
for feature in feature_names:
    tmp_dfs.append(optimise_one_parameter(
      fin_df[fin_df.poutcome_success <= 0.5], feature))

opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', ascending = True).head(5)

Image by author

The only question we still need to discuss is the stopping criteria. In our initial example, we’ve used two conditions:

max_depth = 2 — it just limits the maximum depth of the tree,
min_samples_leaf = 1000 prevents us from getting leaf nodes with less than 1K samples. Because of this condition, we’ve chosen a binary split by contact_unknown even though age led to a lower Gini coefficient.

Also, I usually limit the min_impurity_decrease that prevent us from going further if the gains are too small. By gains, we mean the decrease of the Gini coefficient.

So, we’ve understood how the Decision Tree Classifier works, and now it’s time to use it in practice.

If you’re interested to see how Decision Tree Regressor works in all detail, you can look it up in my previous article.

Decision Trees: practice

We’ve already built a simple tree model with two layers, but it’s definitely not enough since it’s too simple to get all the insights from the data. Let’s train another Decision Tree by limiting the number of samples in leaves and decreasing impurity (reduction of Gini coefficient).

model = sklearn.tree.DecisionTreeClassifier(
  min_samples_leaf = 1000, min_impurity_decrease=0.001)
model.fit(fin_df[features], fin_df['y'])

dot_data = sklearn.tree.export_graphviz(
    model, out_file=None, feature_names = features, filled = True, 
    proportion = True, precision=2, impurity = True)

graph = graphviz.Source(dot_data)

# saving graph to png file
png_bytes = graph.pipe(format='png')
with open('decision_tree.png','wb') as f:
    f.write(png_bytes)

Image by author

That’s it. We’ve got our rules to split customers into groups (leaves). Now, we can iterate through groups and see which groups of customers we want to contact. Even though our model is relatively small, it’s daunting to copy all conditions from the image. Luckily, we can parse the tree structure and get all the groups from the model.

The Decision Tree classifier has an attribute tree_ that will allow us to get access to low-level attributes of the tree, such as node_count .

n_nodes = model.tree_.node_count
print(n_nodes)
# 13

The tree_ variable also stores the entire tree structure as parallel arrays, where the i_th element of each array stores the information about the node i. For the root i equals to 0.

Here are the arrays we have to represent the tree structure:

children_left and children_right — IDs of left and right nodes, respectively; if the node is a leaf, then -1.
feature — feature used to split the node i .
threshold — threshold value used for the binary split of the node i .
n_node_samples — number of training samples that reached the node i .
values — shares of samples from each class.

Let’s save all these arrays.

children_left = model.tree_.children_left
# [ 1,  2,  3,  4,  5,  6, -1, -1, -1, -1, -1, -1, -1]
children_right = model.tree_.children_right
# [12, 11, 10,  9,  8,  7, -1, -1, -1, -1, -1, -1, -1]
features = model.tree_.feature
# [30, 34,  0,  3,  6,  6, -2, -2, -2, -2, -2, -2, -2]
thresholds = model.tree_.threshold
# [ 0.5,  0.5, 59.5,  0.5,  6.5,  2.5, -2. , -2. , -2. , -2. , -2. , -2. , -2. ]
num_nodes = model.tree_.n_node_samples
# [45211, 43700, 30692, 29328, 14165,  4165,  2053,  2112, 10000, 
#  15163,  1364, 13008,  1511] 
values = model.tree_.value
# [[[0.8830152 , 0.1169848 ]],
# [[0.90135011, 0.09864989]],
# [[0.87671054, 0.12328946]],
# [[0.88550191, 0.11449809]],
# [[0.8530886 , 0.1469114 ]],
# [[0.76686675, 0.23313325]],
# [[0.87043351, 0.12956649]],
# [[0.66619318, 0.33380682]],
# [[0.889     , 0.111     ]],
# [[0.91578184, 0.08421816]],
# [[0.68768328, 0.31231672]],
# [[0.95948647, 0.04051353]],
# [[0.35274653, 0.64725347]]]

It will be more convenient for us to work with a hierarchical view of the tree structure, so let’s iterate through all nodes and, for each node, save the parent node ID and whether it was a right or left branch.

hierarchy = {}

for node_id in range(n_nodes):
  if children_left[node_id] != -1: 
    hierarchy[children_left[node_id]] = {
      'parent': node_id, 
      'condition': 'left'
    }
  
  if children_right[node_id] != -1:
      hierarchy[children_right[node_id]] = {
       'parent': node_id, 
       'condition': 'right'
  }

print(hierarchy)
# {1: {'parent': 0, 'condition': 'left'},
# 12: {'parent': 0, 'condition': 'right'},
# 2: {'parent': 1, 'condition': 'left'},
# 11: {'parent': 1, 'condition': 'right'},
# 3: {'parent': 2, 'condition': 'left'},
# 10: {'parent': 2, 'condition': 'right'},
# 4: {'parent': 3, 'condition': 'left'},
# 9: {'parent': 3, 'condition': 'right'},
# 5: {'parent': 4, 'condition': 'left'},
# 8: {'parent': 4, 'condition': 'right'},
# 6: {'parent': 5, 'condition': 'left'},
# 7: {'parent': 5, 'condition': 'right'}}

The next step is to filter out the leaf nodes since they are terminal and the most interesting for us as they define the customer segments.

leaves = []
for node_id in range(n_nodes):
    if (children_left[node_id] == -1) and (children_right[node_id] == -1):
        leaves.append(node_id)
print(leaves)
# [6, 7, 8, 9, 10, 11, 12]
leaves_df = pd.DataFrame({'node_id': leaves})

The next step is to determine all the conditions applied to each group since they will define our customer segments. The first function get_condition will give us the tuple of feature, condition type and threshold for a node.

def get_condition(node_id, condition, features, thresholds, feature_names):
    # print(node_id, condition)
    feature = feature_names[features[node_id]]
    threshold = thresholds[node_id]
    cond = '>' if condition == 'right'  else '<='
    return (feature, cond, threshold)

print(get_condition(0, 'left', features, thresholds, feature_names)) 
# ('poutcome_success', '<=', 0.5)

print(get_condition(0, 'right', features, thresholds, feature_names))
# ('poutcome_success', '>', 0.5)

The next function will allow us to recursively go from the leaf node to the root and get all the binary splits.

def get_decision_path_rec(node_id, decision_path, hierarchy):
  if node_id == 0:
    yield decision_path 
  else:
    parent_id = hierarchy[node_id]['parent']
    condition = hierarchy[node_id]['condition']
    for res in get_decision_path_rec(parent_id, decision_path + [(parent_id, condition)], hierarchy):
        yield res

decision_path = list(get_decision_path_rec(12, [], hierarchy))[0]
print(decision_path) 
# [(0, 'right')]

fmt_decision_path = list(map(
  lambda x: get_condition(x[0], x[1], features, thresholds, feature_names), 
  decision_path))
print(fmt_decision_path)
# [('poutcome_success', '>', 0.5)]

Let’s save the logic of executing the recursion and formatting into a wrapper function.

def get_decision_path(node_id, features, thresholds, hierarchy, feature_names):
  decision_path = list(get_decision_path_rec(node_id, [], hierarchy))[0]
  return list(map(lambda x: get_condition(x[0], x[1], features, thresholds, 
    feature_names), decision_path))

We’ve learned how to get each node’s binary split conditions. The only remaining logic is to combine the conditions.

def get_decision_path_string(node_id, features, thresholds, hierarchy, 
  feature_names):
  conditions_df = pd.DataFrame(get_decision_path(node_id, features, thresholds, hierarchy, feature_names))
  conditions_df.columns = ['feature', 'condition', 'threshold']

  left_conditions_df = conditions_df[conditions_df.condition == '<=']
  right_conditions_df = conditions_df[conditions_df.condition == '>']

  # deduplication 
  left_conditions_df = left_conditions_df.groupby(['feature', 'condition'], as_index = False).min()
  right_conditions_df = right_conditions_df.groupby(['feature', 'condition'], as_index = False).max()
  
  # concatination
  fin_conditions_df = pd.concat([left_conditions_df, right_conditions_df])\
      .sort_values(['feature', 'condition'], ascending = False)
  
  # formatting 
  fin_conditions_df['cond_string'] = list(map(
      lambda x, y, z: '(%s %s %.2f)' % (x, y, z),
      fin_conditions_df.feature,
      fin_conditions_df.condition,
      fin_conditions_df.threshold
  ))
  return ' and '.join(fin_conditions_df.cond_string.values)

print(get_decision_path_string(12, features, thresholds, hierarchy, 
  feature_names))
# (poutcome_success > 0.50)

Now, we can calculate the conditions for each group.

leaves_df['condition'] = leaves_df['node_id'].map(
  lambda x: get_decision_path_string(x, features, thresholds, hierarchy, 
  feature_names)
)

The last step is to add their size and conversion to the groups.

leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total)\
  .map(lambda x: int(round(x/100)))
leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()

Now, we can use these rules to make decisions. We can sort groups by conversion (probability of successful contact) and pick the customers with the highest probability.

leaves_df.sort_values('conversion', ascending = False)\
  .drop('node_id', axis = 1).set_index('condition')

Image by author

Imagine we have resources to contact only around 10% of our user base, we can focus on the first three groups. Even with such a limited capacity, we would expect to get almost 40% conversion — it’s a really good result, and we’ve achieved it with just a bunch of straightforward heuristics.

In real life, it’s also worth testing the model (or heuristics) before deploying it in production. I would split the training dataset into training and validation parts (by time to avoid leakage) and see the heuristics performance on the validation set to have a better view of the actual model quality.

Working with high cardinality categories

Another topic that is worth discussing in this context is category encoding, since we have to encode the categorical variables for sklearn implementation. We’ve used a straightforward approach with one-hot encoding, but in some cases, it doesn’t work.

Imagine we also have a region in the data. I’ve synthetically generated English cities for each row. We have 155 unique regions, so the number of features has increased to 190.

model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 100, min_impurity_decrease=0.001)
model.fit(fin_df[feature_names], fin_df['y'])

So, the basic tree now has lots of conditions based on regions and it’s not convenient to work with them.

Image by author

In such a case, it might not be meaningful to explode the number of features, and it’s time to think about encoding. There’s a comprehensive article, “Categorically: Don’t explode — encode!”, that shares a bunch of different options to handle high cardinality categorical variables. I think the most feasible ones in our case will be the following two options:

Count or Frequency Encoder that shows good performance in benchmarks. This encoding assumes that categories of similar size would have similar characteristics.
Target Encoder, where we can encode the category by the mean value of the target variable. It will allow us to prioritise segments with higher conversion and deprioritise segments with lower. Ideally, it would be nice to use historical data to get the averages for the encoding, but we will use the existing dataset.

However, it will be interesting to test different approaches, so let’s split our dataset into train and test, saving 10% for validation. For simplicity, I’ve used one-hot encoding for all columns except for region (since it has the highest cardinality).

from sklearn.model_selection import train_test_split
fin_df = pd.get_dummies(df, columns=['job', 'marital', 'education', 
  'poutcome', 'contact'], dtype = int, drop_first = False)
train_df, test_df = train_test_split(fin_df,test_size=0.1, random_state=42)
print(train_df.shape[0], test_df.shape[0])
# (40689, 4522)

For convenience, let’s combine all the logic for parsing the tree into one function.

def get_model_definition(model, feature_names):
  n_nodes = model.tree_.node_count
  children_left = model.tree_.children_left
  children_right = model.tree_.children_right
  features = model.tree_.feature
  thresholds = model.tree_.threshold
  num_nodes = model.tree_.n_node_samples
  values = model.tree_.value

  hierarchy = {}

  for node_id in range(n_nodes):
      if children_left[node_id] != -1: 
          hierarchy[children_left[node_id]] = {
            'parent': node_id, 
            'condition': 'left'
          }
    
      if children_right[node_id] != -1:
            hierarchy[children_right[node_id]] = {
             'parent': node_id, 
             'condition': 'right'
            }

  leaves = []
  for node_id in range(n_nodes):
      if (children_left[node_id] == -1) and (children_right[node_id] == -1):
          leaves.append(node_id)
  leaves_df = pd.DataFrame({'node_id': leaves})
  leaves_df['condition'] = leaves_df['node_id'].map(
    lambda x: get_decision_path_string(x, features, thresholds, hierarchy, feature_names)
  )

  leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
  leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
  leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total).map(lambda x: int(round(x/100)))
  leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
  leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()
  leaves_df = leaves_df.sort_values('conversion', ascending = False)\
    .drop('node_id', axis = 1).set_index('condition')
  leaves_df['cum_share_of_total'] = leaves_df['share_of_total'].cumsum()
  leaves_df['cum_share_of_converted'] = leaves_df['share_of_converted'].cumsum()
  return leaves_df

Let’s create an encodings data frame, calculating frequencies and conversions.

region_encoding_df = train_df.groupby('region', as_index = False)\
  .aggregate({'id': 'count', 'y': 'mean'}).rename(columns = 
    {'id': 'region_count', 'y': 'region_target'})

Then, merge it into our training and validation sets. For the validation set, we will also fill NAs as averages.

train_df = train_df.merge(region_encoding_df, on = 'region')

test_df = test_df.merge(region_encoding_df, on = 'region', how = 'left')
test_df['region_target'] = test_df['region_target']\
  .fillna(region_encoding_df.region_target.mean())
test_df['region_count'] = test_df['region_count']\
  .fillna(region_encoding_df.region_count.mean())

Now, we can fit the models and get their structures.

count_feature_names = train_df.drop(
  ['y', 'id', 'region_target', 'region'], axis = 1).columns
target_feature_names = train_df.drop(
  ['y', 'id', 'region_count', 'region'], axis = 1).columns
print(len(count_feature_names), len(target_feature_names))
# (36, 36)

count_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
count_model.fit(train_df[count_feature_names], train_df['y'])

target_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
target_model.fit(train_df[target_feature_names], train_df['y'])

count_model_def_df = get_model_definition(count_model, count_feature_names)
target_model_def_df = get_model_definition(target_model, target_feature_names)

Let’s look at the structures and select the top categories up to 10–15% of our target audience. We can also apply these conditions to our validation sets to test our approach in practice.

Let’s start with Count Encoder.

Image by author

count_selected_df = test_df[
    (test_df.poutcome_success > 0.50) | 
    ((test_df.poutcome_success <= 0.50) & (test_df.age > 60.50)) | 
    ((test_df.region_count > 3645.50) & (test_df.region_count <= 8151.50) & 
         (test_df.poutcome_success <= 0.50) & (test_df.contact_cellular > 0.50) & (test_df.age <= 60.50))
]

print(count_selected_df.shape[0], count_selected_df.y.sum())
# (508, 227)

We can also see what regions have been selected, and it’s only Manchester.

Image by author

Let’s continue with the Target encoding.

Image by author

target_selected_df = test_df[
    ((test_df.region_target > 0.21) & (test_df.poutcome_success > 0.50)) | 
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month <= 6.50) & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month > 8.50) & (test_df.housing <= 0.50) 
         & (test_df.contact_unknown <= 0.50)) |
    ((test_df.region_target <= 0.21) & (test_df.poutcome_success > 0.50)) |
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month > 6.50) & (test_df.month <= 8.50) 
         & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50))
]

print(target_selected_df.shape[0], target_selected_df.y.sum())
# (502, 248)

We see a slightly lower number of selected users for communication but a significantly higher number of conversions: 248 vs. 227 (+9.3%).

Let’s also look at the selected categories. We see that the model picked up all the cities with high conversions (Manchester, Liverpool, Bristol, Leicester, and New Castle), but there are also many small regions with high conversions solely due to chance.

region_encoding_df[region_encoding_df.region_target > 0.21]\
  .sort_values('region_count', ascending = False)

Image by author

In our case, it doesn’t impact much since the share of such small cities is low. However, if you have way more small categories, you might see significant drawbacks of overfitting. Target Encoding might be tricky at this point, so it’s worth keeping an eye on the output of your model.

Luckily, there’s an approach that can help you overcome this issue. Following the article “Encoding Categorical Variables: A Deep Dive into Target Encoding”, we can add smoothing. The idea is to combine the group’s conversion rate with the overall average: the larger the group, the more weight its data carries, while smaller segments will lean more towards the global average.

First, I’ve selected the parameters that make sense for our distribution, looking at a bunch of options. I chose to use the global average for the groups under 100 people. This part is a bit subjective, so use common sense and your knowledge about the business domain.

import numpy as np
import matplotlib.pyplot as plt

global_mean = train_df.y.mean()

k = 100
f = 10
smooth_df = pd.DataFrame({'region_count':np.arange(1, 100001, 1) })
smooth_df['smoothing'] = (1 / (1 + np.exp(-(smooth_df.region_count - k) / f)))

ax = plt.scatter(smooth_df.region_count, smooth_df.smoothing)
plt.xscale('log')
plt.ylim([-.1, 1.1])
plt.title('Smoothing')

Image by author

Then, we can calculate, based on the selected parameters, the smoothing coefficients and blended averages.

region_encoding_df['smoothing'] = (1 / (1 + np.exp(-(region_encoding_df.region_count - k) / f)))
region_encoding_df['region_target'] = region_encoding_df.smoothing * region_encoding_df.raw_region_target \
    + (1 - region_encoding_df.smoothing) * global_mean

Then, we can fit another model with smoothed target category encoding.

train_df = train_df.merge(region_encoding_df[['region', 'region_target']], 
  on = 'region')
test_df = test_df.merge(region_encoding_df[['region', 'region_target']], 
  on = 'region', how = 'left')
test_df['region_target'] = test_df['region_target']\
  .fillna(region_encoding_df.region_target.mean())

target_v2_feature_names = train_df.drop(['y', 'id', 'region'], axis = 1)\
  .columns

target_v2_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
target_v2_model.fit(train_df[target_v2_feature_names], train_df['y'])
target_v2_model_def_df = get_model_definition(target_v2_model, 
  target_v2_feature_names)

Image by author

target_v2_selected_df = test_df[
    ((test_df.region_target > 0.12) & (test_df.poutcome_success > 0.50)) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month <= 6.50) & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month > 8.50) & (test_df.housing <= 0.50) 
         & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target <= 0.12) & (test_df.poutcome_success > 0.50) ) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month > 6.50) & (test_df.month <= 8.50) 
         & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50) )
]

target_v2_selected_df.shape[0], target_v2_selected_df.y.sum()
# (500, 247)

We can see that we’ve eliminated the small cities and prevented overfitting in our model while keeping roughly the same performance, capturing 247 conversions.

region_encoding_df[region_encoding_df.region_target > 0.12]

Image by author

You can also use TargetEncoder from sklearn, which smoothes and mixes the category and global means depending on the segment size. However, it also adds random noise, which is not ideal for our case of heuristics.

You can find the full code on GitHub.

Summary

In this article, we explored how to extract simple “rules” from data and use them to inform business decisions. We generated heuristics using a Decision Tree Classifier and touched on the important topic of categorical encoding since decision tree algorithms require categorical variables to be converted.

We saw that this rule-based approach can be surprisingly effective, helping you reach business decisions quickly. However, it’s worth noting that this simplistic approach has its drawbacks:

We are trading off the model’s power and accuracy for its simplicity and interpretability, so if you’re optimising for accuracy, choose another approach.
Even though we’re using a set of static heuristics, your data still can change, and they might become outdated, so you need to recheck your model from time to time.

Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

Dataset: Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306

The post Mining Rules from Data appeared first on Towards Data Science.

A Data Scientist’s Guide to Docker Containers

Jonte Dancker — Tue, 08 Apr 2025 20:02:45 +0000

For a ML model to be useful it needs to run somewhere. This somewhere is most likely not your local machine. A not-so-good model that runs in a production environment is better than a perfect model that never leaves your local machine.

However, the production machine is usually different from the one you developed the model on. So, you ship the model to the production machine, but somehow the model doesn’t work anymore. That’s weird, right? You tested everything on your local machine and it worked fine. You even wrote unit tests.

What happened? Most likely the production machine differs from your local machine. Perhaps it does not have all the needed dependencies installed to run your model. Perhaps installed dependencies are on a different version. There can be many reasons for this.

How can you solve this problem? One approach could be to exactly replicate the production machine. But that is very inflexible as for each new production machine you would need to build a local replica.

A much nicer approach is to use Docker containers.

Docker is a tool that helps us to create, manage, and run code and applications in containers. A container is a small isolated computing environment in which we can package an application with all its dependencies. In our case our ML model with all the libraries it needs to run. With this, we do not need to rely on what is installed on the host machine. A Docker Container enables us to separate applications from the underlying infrastructure.

For example, we package our ML model locally and push it to the cloud. With this, Docker helps us to ensure that our model can run anywhere and anytime. Using Docker has several advantages for us. It helps us to deliver new models faster, improve reproducibility, and make collaboration easier. All because we have exactly the same dependencies no matter where we run the container.

As Docker is widely used in the industry Data Scientists need to be able to build and run containers using Docker. Hence, in this article, I will go through the basic concept of containers. I will show you all you need to know about Docker to get started. After we have covered the theory, I will show you how you can build and run your own Docker container.

What is a container?

A container is a small, isolated environment in which everything is self-contained. The environment packages up all code and dependencies.

A container has five main features.

self-contained: A container isolates the application/software, from its environment/infrastructure. Due to this isolation, we do not need to rely on any pre-installed dependencies on the host machine. Everything we need is part of the container. This ensures that the application can always run regardless of the infrastructure.
isolated: The container has a minimal influence on the host and other containers and vice versa.
independent: We can manage containers independently. Deleting a container does not affect other containers.
portable: As a container isolates the software from the hardware, we can run it seamlessly on any machine. With this, we can move it between machines without a problem.
lightweight: Containers are lightweight as they share the host machine’s OS. As they do not require their own OS, we do not need to partition the hardware resource of the host machine.

This might sound similar to virtual machines. But there is one big difference. The difference is in how they use their host computer’s resources. Virtual machines are an abstraction of the physical hardware. They partition one server into multiple. Thus, a VM includes a full copy of the OS which takes up more space.

In contrast, containers are an abstraction at the application layer. All containers share the host’s OS but run in isolated processes. Because containers do not contain an OS, they are more efficient in using the underlying system and resources by reducing overhead.

Containers vs. Virtual Machines (Image by the author based on docker.com)

Now we know what containers are. Let’s get some high-level understanding of how Docker works. I will briefly introduce the technical terms that are used often.

What is Docker?

To understand how Docker works, let’s have a brief look at its architecture.

Docker uses a client-server architecture containing three main parts: A Docker client, a Docker daemon (server), and a Docker registry.

The Docker client is the primary way to interact with Docker through commands. We use the client to communicate through a REST API with as many Docker daemons as we want. Often used commands are docker run, docker build, docker pull, and docker push. I will explain later what they do.

The Docker daemon manages Docker objects, such as images and containers. The daemon listens for Docker API requests. Depending on the request the daemon builds, runs, and distributes Docker containers. The Docker daemon and client can run on the same or different systems.

The Docker registry is a centralized location that stores and manages Docker images. We can use them to share images and make them accessible to others.

Sounds a bit abstract? No worries, once we get started it will be more intuitive. But before that, let’s run through the needed steps to create a Docker container.

Docker Architecture (Image by author based on docker.com)

What do we need to create a Docker container?

It is simple. We only need to do three steps:

create a Dockerfile
build a Docker Image from the Dockerfile
run the Docker Image to create a Docker container

Let’s go step-by-step.

A Dockerfile is a text file that contains instructions on how to build a Docker Image. In the Dockerfile we define what the application looks like and its dependencies. We also state what process should run when launching the Docker container. The Dockerfile is composed of layers, representing a portion of the image’s file system. Each layer either adds, removes, or modifies the layer below it.

Based on the Dockerfile we create a Docker Image. The image is a read-only template with instructions to run a Docker container. Images are immutable. Once we create a Docker Image we cannot modify it anymore. If we want to make changes, we can only add changes on top of existing images or create a new image. When we rebuild an image, Docker is clever enough to rebuild only layers that have changed, reducing the build time.

A Docker Container is a runnable instance of a Docker Image. The container is defined by the image and any configuration options that we provide when creating or starting the container. When we remove a container all changes to its internal states are also removed if they are not stored in a persistent storage.

Using Docker: An example

With all the theory, let’s get our hands dirty and put everything together.

As an example, we will package a simple ML model with Flask in a Docker container. We can then run requests against the container and receive predictions in return. We will train a model locally and only load the artifacts of the trained model in the Docker Container.

I will go through the general workflow needed to create and run a Docker container with your ML model. I will guide you through the following steps:

build model
create requirements.txt file containing all dependencies
create Dockerfile
build docker image
run container

Before we get started, we need to install Docker Desktop. We will use it to view and run our Docker containers later on.

1. Build a model

First, we will train a simple RandomForestClassifier on scikit-learn’s Iris dataset and then store the trained model.

Second, we build a script making our model available through a Rest API, using Flask. The script is also simple and contains three main steps:

extract and convert the data we want to pass into the model from the payload JSON
load the model artifacts and create an onnx session and run the model
return the model’s predictions as json

I took most of the code from here and here and made only minor changes.

2. Create requirements

Once we have created the Python file we want to execute when the Docker container is running, we must create a requirements.txt file containing all dependencies. In our case, it looks like this:

3. Create Dockerfile

The last thing we need to prepare before being able to build a Docker Image and run a Docker container is to write a Dockerfile.

The Dockerfile contains all the instructions needed to build the Docker Image. The most common instructions are

FROM — this specifies the base image that the build will extend.
WORKDIR — this instruction specifies the “working directory” or the path in the image where files will be copied and commands will be executed.
COPY — this instruction tells the builder to copy files from the host and put them into the container image.
RUN — this instruction tells the builder to run the specified command.
ENV — this instruction sets an environment variable that a running container will use.
EXPOSE — this instruction sets the configuration on the image that indicates a port the image would like to expose.
USER — this instruction sets the default user for all subsequent instructions.
CMD ["", ""] — this instruction sets the default command a container using this image will run.

With these, we can create the Dockerfile for our example. We need to follow the following steps:

Determine the base image
Install application dependencies
Copy in any relevant source code and/or binaries
Configure the final image

Let’s go through them step by step. Each of these steps results in a layer in the Docker Image.

First, we specify the base image that we then build upon. As we have written in the example in Python, we will use a Python base image.

Second, we set the working directory into which we will copy all the files we need to be able to run our ML model.

Third, we refresh the package index files to ensure that we have the latest available information about packages and their versions.

Fourth, we copy in and install the application dependencies.

Fifth, we copy in the source code and all other files we need. Here, we also expose port 8080, which we will use for interacting with the ML model.

Sixth, we set a user, so that the container does not run as the root user

Seventh, we define that the example.py file will be executed when we run the Docker container. With this, we create the Flask server to run our requests against.

Besides creating the Dockerfile, we can also create a .dockerignore file to improve the build speed. Similar to a .gitignore file, we can exclude directories from the build context.

If you want to know more, please go to docker.com.

4. Create Docker Image

After we created all the files we needed to build the Docker Image.

To build the image we first need to open Docker Desktop. You can check if Docker Desktop is running by running docker ps in the command line. This command shows you all running containers.

To build a Docker Image, we need to be at the same level as our Dockerfile and requirements.txt file. We can then run docker build -t our_first_image . The -t flag indicates the name of the image, i.e., our_first_image, and the . tells us to build from this current directory.

Once we built the image we can do several things. We can

view the image by running docker image ls
view the history or how the image was created by running docker image history
push the image to a registry by running docker push

5. Run Docker Container

Once we have built the Docker Image, we can run our ML model in a container.

For this, we only need to execute docker run -p 8080:8080 in the command line. With -p 8080:8080 we connect the local port (8080) with the port in the container (8080).

If the Docker Image doesn’t expose a port, we could simply run docker run . Instead of using the image_name, we can also use the image_id.

Okay, once the container is running, let’s run a request against it. For this, we will send a payload to the endpoint by running curl X POST http://localhost:8080/invocations -H "Content-Type:application/json" -d @.path/to/sample_payload.json

Conclusion

In this article, I showed you the basics of Docker Containers, what they are, and how to build them yourself. Although I only scratched the surface it should be enough to get you started and be able to package your next model. With this knowledge, you should be able to avoid the “it works on my machine” problems.

I hope that you find this article useful and that it will help you become a better Data Scientist.

See you in my next article and/or leave a comment.

The post A Data Scientist’s Guide to Docker Containers appeared first on Towards Data Science.

Unlock the Power of ROC Curves: Intuitive Insights for Better Model Evaluation

SWAPNIL PATIL — Tue, 08 Apr 2025 19:38:43 +0000

We’ve all been in that moment, right? Staring at a chart as if it’s some ancient script, wondering how we’re supposed to make sense of it all. That’s exactly how I felt when I was asked to explain the AUC for the ROC curve at work recently.

Though I had a solid understanding of the math behind it, breaking it down into simple, digestible terms proved to be a challenge. I realized that if I was struggling with it, others probably were too. So, I decided to write this article to share an intuitive way to understand the AUC-ROC curve through a practical example. No dry definitions here—just clear, straightforward explanations focused on the intuition.

Here’s the code¹ used in this article.

Every data scientist goes through a phase of evaluating classification models. Amidst an array of evaluation metrics, Receiver Operating Characteristic (ROC) curve and the Area Under The Curve (AUC) is an indispensable tool for gauging model’s performance. In this comprehensive article, we will discuss basic concepts and see them in action using our good old Titanic dataset².

Section 1: ROC Curve

At its core, the ROC curve visually portrays the delicate balance between a model’s sensitivity and specificity across varying classification thresholds.

To fully grasp the ROC curve, let’s delve into the concepts:

Sensitivity/Recall (True Positive Rate): Sensitivity quantifies a model’s adeptness at correctly identifying positive instances. In our Titanic example, sensitivity corresponds to the the proportion of actual survival cases that the model accurately labels as positive.

Specificity (True Negative Rate): Specificity measures a model’s proficiency in correctly identifying negative instances. For our dataset, it represents the proportion of actual non-survived cases (Survival = 0) that the model correctly identifies as negative.

False Positive Rate: FPR measures the proportion of negative instances that are incorrectly classified as positive by the model.

Notice that Specificity and FPR are complementary to each other. While specificity focuses on the correct classification of negative instances, FPR focuses on the incorrect classification of negative instances as positive. Thus-

Now that we know the definitions, let’s work with an example. For Titanic dataset, I have built a simple logistic regression model that predicts whether the passenger survived the shipwreck or not, using following features: Passenger Class, Sex, # of siblings/spouses aboard, passenger fare and Port of Embarkation. Note that, the model predicts the ‘probability of survival’. The default threshold for logistic regression in sklearn is 0.5. However, this default threshold may not always make sense for the problem being solved and we need to play around with the probability threshold i.e. if the predicted probability > threshold, instance is predicted to be positive else negative.

Now, let’s revisit the definitions of Sensitivity, Specificity and FPR above. Since our predicted binary classification is dependent on the probability threshold, for the given model, these three metrics will change based on the probability threshold we use. If we use a higher probability threshold, we will classify fewer cases as positives i.e. our true positives will be fewer, resulting in lower Sensitivity/Recall. A higher probability threshold also means fewer false positives, so low FPR. As such, increasing sensitivity/recall could lead to increased FPR.

For our training data, we will use 10 different probability cutoffs and calculate Sensitivity/TPR and FPR and plot in a chart below. Note, the size of circles in the scatterplot correspond to the probability threshold used for classification.

Chart 1: FPR vs TPR chart along with actual values in the DataFrame (image by author)

Well, that’s it. The graph we created above plots Sensitivity (TPR) Vs. FPR at various probability thresholds IS the ROC curve!

In our experiment, we used 10 different probability cutoffs with an increment of 0.1 giving us 10 observations. If we use a smaller increment for the probability threshold, we will end up with more data points and the graph will look like our familiar ROC curve.

To confirm our understanding, for the model we built for predicting passenger’s survival, we will loop through various predicted probability thresholds and calculate TPR, FPR for the testing dataset (see code snippet below). Plot the results in a graph and compare this graph with the ROC curve plotted using sklearn’s roc_curve³ function.

Chart 2: sklearn ROC curve on the left and manually created ROC curve on right (image by author)

As we can see, the two curves are almost identical. Note the AUC=0.92 was calculated using the roc_auc_score⁴ function. We will discuss this AUC in the later part of this article.

To summarize, ROC curve plots TPR and FPR for the model at various probability thresholds. Note that, the actual probabilities are NOT displayed in the graph, but one can assume that the observations on the lower left side of the curve correspond to higher probability thresholds (low TPR), and observation on the top right side correspond to lower probability thresholds (high TPR).

To visualize what’s stated above, refer to the below chart, where I have tried to annotate TPR and FPR at different probability cutoffs.

Chart 3: ROC Curve with different probability cutoffs (image by author)

Section 2: AUC

Now that we have developed some intuition around what ROC curve is, the next step is to understand Area Under the Curve (AUC). But before delving into the specifics, let’s think about what a perfect classifier looks like. In the ideal case, we want the model to achieve perfect separation between positive and negative observations. In other words, the model assigns low probabilities to negative observations and high probabilities to positive observations with no overlap. Thus, there will exist some probability cut off, such that all observations with predicted probability < cut off are negative, and all observations with probability >= cut off are positive. When this happens, True Positive Rate will be 1 and False Positive Rate will be 0. So the ideal state to achieve is TPR=1 and FPR=0. In reality, this does not happen, and a more practical expectation should be to maximize TPR and minimize FPR.

In general, as TPR increases with lowering probability threshold, the FPR also increases (see chart 1). We want TPR to be much higher than FPR. This is characterized by the ROC curve that is bent towards the top left side. The following ROC space chart shows the perfect classifier with a blue circle (TPR=1 and FPR=0). Models that yield the ROC curve closer to the blue circle are better. Intuitively, it means that the model is able to fairly separate negative and positive observations. Among the ROC curves in the following chart, light blue is best followed by green and orange. The dashed diagonal line represents random guesses (think of a coin flip).

Chart 4: ROC Curve Comparison (source⁵)

Now that we understand ROC curves skewed to the top left are better, how do we quantify this? Well, mathematically, this can be quantified by calculating the Area Under the Curve. The Area Under the Curve (AUC) of the ROC curve is always between 0 and 1 because our ROC space is bounded between 0 and 1 on both axes. Among the above ROC curves, the model corresponding to the light blue ROC curve is better compared to green and orange as it has higher AUC.

But how is AUC calculated? Computationally, AUC involves integrating the Roc curve. For models generating discrete predictions, AUC can be approximated using the trapezoidal rule⁶. In its simplest form, the trapezoidal rule works by approximating the region under the graph as a trapezoid and calculating its area. I’ll probably discuss this in another article.

This brings us to the last and the most awaited part — how to intuitively make sense of AUC? Let’s say you built a first version of a classification model with AUC 0.7 and you later fine tune the model. The revised model has an AUC of 0.9. We understand that the model with higher AUC is better. But what does it really mean? What does it imply about our improved prediction power? Why does it matter? Well, there’s a lot of literature explaining AUC and its interpretation. Some of them are too technical, some incomplete, and some are outright wrong! One interpretation that made the most sense to me is:

AUC is the probability that a randomly chosen positive instance possesses a higher predicted probability than a randomly chosen negative instance.

Let’s verify this interpretation. For the simple logistic regression we built, we will visualize the predicted probabilities of positive and negative classes (i.e. Survived the shipwreck or not).

Chart 5: Predicted Probabilities of Survived and Not Survived Passengers (image by author)

We can see the model performs pretty well in assigning a higher probability to Survived cases than those that did not. There’s some overlap of probabilities in the middle section. The AUC calculated using the auc score function in sklearn for our model on the test dataset is 0.92 (see chart 2). So based on the above interpretation of AUC, if we randomly choose a positive instance and a negative instance, the probability that the positive instance will have a higher predicted probability than the negative instance should be ~92%.

For this purpose, we will create pools of predicted probabilities of positive and negative outcomes. Now we randomly select one observation each from both the pools and compare their predicted probabilities. We repeat this 100K times. Later we calculate % of times the predicted probability of a positive instance was > predicted probability of a negative instance. If our interpretation is correct, this should be equal to AUC.

We did indeed get 0.92! Hope this helps.

Let me know your comments and feel free to connect with me on LinkedIn.

Note — this article is revised version of the original article that I wrote on Medium in 2023.

References:

The post Unlock the Power of ROC Curves: Intuitive Insights for Better Model Evaluation appeared first on Towards Data Science.

Let’s Call a Spade a Spade: RDF and LPG — Cousins Who Should Learn to Live Together

Jeremy Debattista — Mon, 07 Apr 2025 23:28:25 +0000

In recent years, there has been a proliferation of articles, LinkedIn posts, and marketing materials presenting graph data models from different perspectives. This article will refrain from discussing specific products and instead focus solely on the comparison of RDF (Resource Description Framework) and LPG (Labelled Property Graph) data models. To clarify, there is no mutually exclusive choice between RDF and LPG — they can be employed in conjunction. The appropriate choice depends on the specific use case, and in some instances both models may be necessary; there is no single data model that is universally applicable. In fact, polyglot persistence and multi—model databases (databases that can support different data models within the database engine or on top of the engine), are gaining popularity as enterprises recognise the importance of storing data in diverse formats to maximise its value and prevent stagnation. For instance, storing time series financial data in a graph model is not the most efficient approach, as it could result in minimal value extraction compared to storing it in a time series matrix database, which enables rapid and multi—dimensional analytical queries.

The purpose of this discussion is to provide a comprehensive comparison of RDF and Lpg data models, highlighting their distinct purposes and overlapping usage. While articles often present biased evaluations, promoting their own tools, it is essential to acknowledge that these comparisons are often flawed, as they compare apples to wheelbarrows rather than apples to apples. This subjectivity can leave readers perplexed and uncertain about the author’s intended message. In contrast, this article aims to provide an objective analysis, focusing on the strengths and weaknesses of both RDF and LPG data models, rather than acting as promotional material for any tool.

Quick recap of the data models

Both Rdf and LPG are descendants of the graph data model, although they possess different structures and characteristics. A graph comprises vertices (nodes) and edges that connect two vertices. Various graph types exist, including undirected graphs, directed graphs, multigraphs, hypergraphs and so on. The RDF and LPG data models adopt the directed multigraph approach, wherein edges have the “from” and “to” ordering, and can join an arbitrary number of distinct edges.

The RDF data model is represented by a set of triples reflecting the natural language structure of subject—verb—object, with the subject, predicate, and object represented as such. Consider the following simple example: Jeremy was born in Birkirkara. This sentence can be represented as an RDF statement or fact with the following structure — Jeremy is a subject resource, the predicate (relation) is born in, and the object value of Birkirkara. The value node could either be a URI (unique resource identifier) or a datatype value (such as integer or string). If the object is a semantic URI, or as they are also known a resource, then the object would lead to other facts, such as Birkirkara townIn Malta. This data model allows for resources to be reused and interlinked in the same RDF—based graph, or in any other RDF graph, internal or external. Once a resource is defined and a URI is “minted”, this URI becomes instantly available and can be used in any context that is deemed necessary.

On the other hand, the LPG data model encapsulates the set of vertices, edges, label assignment functions for vertices and edges, and key—value property assignment function for vertices and edges. For the previous example, the representation would be as follows:


(person:Person {name: "Jeremy"})

(city:City {name: "Birkirkara"}) 

(person)—[:BORN_IN]—>(city)

Consequently, the primary distinction between RDF and LPG lies within how nodes are connected together. In the RDF model, relationships are triples where predicates define the connection. In the LPG data model, edges are first—class citizens with their own properties. Therefore, in the RDF data model, predicates are globally defined in a schema and are reused in data graphs, whilst in the LPG data model, each edge is uniquely identified.

Schema vs Schema—less. Do semantics matter at all?

Semantics is a branch of linguistics and logic that is concerned about the meaning, in this case the meaning of data, enabling both humans and machines to interpret the context of the data and any relationships in the said context.

Historically, the World Wide Web Consortium (W3C) established the Resource Description Framework (RDF) data model as a standardised framework for data exchange within the Web. RDF facilitates seamless data integration and the merging of diverse sources, while simultaneously supporting schema evolution without necessitating modifications to data consumers. Schemas¹, or ontologies, serve as the foundation for data represented in RDF, and through these ontologies the semantic meaning of the data can be defined. This capability makes data integration one of the numerous suitable applications of the RDF data model. Through various W3C groups, standards were established on how schemas and ontologies can be defined, primarily RDF Schema (RDFS), Web Ontology Language (OWL), and recently SHACL. RDFS provides the low—level constructs for defining ontologies, such as the Person entity with properties name, gender, knows, and the expected type of node. OWL provides constructs and mechanisms for formally defining ontologies through axioms and rules, enabling the inference of implicit data. Whilst OWL axioms are taken as part of the knowledge graph and used to infer additional facts, SHACL was introduced as a schema to validate constraints, better known as data shapes (consider it as “what should a Person consist of?”) against the knowledge graph. Moreover, through additional features to the SHACL specifications, rules and inference axioms can also be defined using SHACL.

In summary, schemas facilitate the enforcement of the right instance data. This is possible because the RDF permits any value to be defined within a fact, provided it adheres to the specifications. Validators, such as in—built SHACL engines or OWL constructs, are responsible for verifying the data’s integrity. Given that these validators are standardised, all triple stores, those adhering to the RDF data model, are encouraged to implement them. However, this does not negate the concept of flexibility. The RDF data model is designed to accommodate the growth, extension, and evolution of data within the schema’s boundaries. Consequently, while an RDF data model strongly encourages the use of schemas (or ontologies) as its foundation, experts discourage the creation of ivory tower ontologies. This endeavour does require an upfront effort and collaboration with domain experts to construct an ontology that accurately reflects the use case and the data that will be stored in the knowledge graph. Nonetheless, the RDF data model offers the flexibility to create and define RDF—based data independently of a pre—existing ontology, or to develop an ontology iteratively throughout a data project. Furthermore, schemas are designed for reuse, and the RDF data model facilitates this reusability. It is noteworthy that an RDF—based knowledge graph typically encompasses both instance data (such as “Giulia and Matteo are siblings”) and ontology/schema axioms (such as “Two people are siblings when they have a parent in common”).

Nonetheless, the significance of ontologies extends beyond providing a data structure; they also impart semantic meaning to the data. For instance, in constructing a family tree, an ontology enables the explicit definition of relationships such as aunt, uncle, cousins, niece, nephew, ancestors, and descendants without the need for the explicit data to be defined in the knowledge graph. Consider how this concept can be applied in various pharmaceutical scenarios, just to mention one vertical domain. Reasoning is a fundamental component that renders the RDF data model a semantically powerful model for designing knowledge graphs. Ontologies provide a particular data point with all the necessary context, including its neighbourhood and its meaning. For instance, if there is a literal node with the value 37, an RDF—based agent can comprehend that the value 37 represents the age of a person named Jeremy, who is the nephew of a person named Peter.

In contrast, the LPG data model offers a more agile and straightforward deployment of graph data. LPGs have reduced focus on schemas (they only support some constraints and “labels”/classes). Graph databases adhering to the LPG data model are known for their speed in preparing data for consumption due to its schema—less nature. This makes them a more suitable choice for data architects seeking to deploy their data in such a manner. The LPG data model is particularly advantageous in scenarios where data is not intended for growth or significant changes. For instance, a modification to a property would necessitate refactoring the graph to update nodes with the newly added or updated key—value property. While LPG provides the illusion of providing semantics through node and edge labels and corresponding functions, it does not inherently do so. LPG functions consistently return a map of values associated with a node or edge. Nonetheless, this is fundamental when dealing with use cases that need to perform fast graph algorithms as the data is available directly in the nodes and edges, and there is no need for further graph traversal.

However, one fundamental feature of the LPG data model is its ease and flexibility of attaching granular attributes or properties to either vertices or edges. For instance, if there are two person nodes, “Alice” and “Bob,” with an edge labelled “marriedTo,” the LPG data model can accurately and easily state that Alice and Bob were married on February 29, 2024. In contrast, the RDF data model could achieve this through various workarounds, such as reification, but this would result in more complex queries compared to the LPG data model’s counterpart.

Standards, Standardisation Bodies, Interoperability.

In the previous section we described how W3C provides standardisation groups pertaining to the RDF data model. For instance, a W3C working group is actively developing the RDF* standard, which incorporates the complex relationship concept (attaching attributes to facts/triples) within the RDF data model. This standard is anticipated to be adopted and supported by all triple stores tools and agents based on the RDF data model. However, the process of standardisation can be protracted, frequently resulting in delays that leave such vendors at a disadvantage.

Nonetheless, standards facilitate much—needed interoperability. Knowledge Graphs built upon the RDF data model can be easily ported between different applications and triple store, as they have no vendor lock—in, and standardisation formats are provided. Similarly, they can be queried with one standard query language called SPARQL, which is used by the different vendors. Whilst the query language is the same, vendors opt for different query execution plans, equivalent to how any database engine (SQL or NoSQL) is implemented, to enhance performance and speed.

Most LPG graph implementations, although open source, utilise proprietary or custom languages for storing and querying data, lacking a standard adherence. This practice decreases interoperability and portability of data between different vendors. However, in recent months, ISO approved and published ISO/IEC 39075:2024 that standardises the Graph Query Language (GQL) based on Cypher. As the charter rightly points out, the graph data model has unique advantages over relational databases such as fitting data that is meant to have hierarchical, complex or arbitrary structures. Nevertheless, the proliferation of vendor—specific implementations overlooks a crucial functionality – a standardised approach to querying property graphs. Therefore, it is paramount that property graph vendors reflect their products to this standard.

Recently, OneGraph² was proposed as an interoperable metamodel that is meant to overcome the choice between the RDF data model and the LPG data model. Furthermore, extensions to openCypher are proposed³ to allow the querying over RDF data to be extended as a way of querying over RDF data. This vision aims to pave the way for having data in both RDF and LPG combined in a single, integrated database, ensuring the benefits of both data models.

Other notable differences

Notable differences, mostly in query languages, are there to support the data models. However, we strongly argue against the fact that a set of query language features should dictate which data model to use. Nonetheless, we will discuss some of the differences here for a more complete overview.

The RDF data model offers a natural way of supporting global unique resource identifiers (URIs), which manifest in three distinct characteristics. Within the RDF domain, a set of facts described by an RDF statement (i.e. s, p, o) having the same subject URI is referred to as a resource. Data stored in RDF graphs can be conveniently split into multiple named graphs, ensuring that each graph encapsulates distinct concerns. For instance, using the RDF data model it is straightforward to construct graphs that store data or resources, metadata, audit and provenance data separately, whilst interlinking and querying capabilities can be seamlessly executed across these multiple graphs. Furthermore, graphs can establish interlinks with resources located in graphs hosted on different servers. Querying these external resources is facilitated through query federation within the SPARQL protocol. Given the adoption of URIs, RDF embodies the original vision of Linked Data⁴, a vision that has since been adopted, to an extent, as a guiding principle in the FAIR principles⁵, Data Fabric, Data Mesh, and HATEOAS amongst others. Consequently, the RDF data model serves as a versatile framework that can seamlessly integrate with these visions without the need for any modifications.

LPGs, on the other hand, are better geared towards path traversal queries, graph analytics and variable length path queries. Whilst these functionalities can be considered as specific implementations in the query language, they are pertinent considerations when modelling data in a graph, since these are also benefits over traditional relational databases. SPARQL, through the W3C recommendation, has limited support to path traversal⁶, and some vendor triple store implementations do support and implement (although not as part of the SPARQL 1.1 recommendation) variable length path⁷. At time of writing, the SPARQL 1.2 recommendation will not incorporate this feature either.

Data Graph Patterns

The following section describes various data graph patterns and how they would fit, or not, both data models discussed in this article.

Pattern	RDF data model	LPG data model
Global Definition of relations/properties	Through schemas properties are globally defined through various semantic properties such as domain and ranges, algebraic properties such as inverse of, reflexive, transitive, and allow for informative annotations on properties definitions.	Semantics of relations (edges) is not supported in property graphs
Multiple Languages	String data can have a language tag attached to it and is considered when processing	Can be a custom field or relationship (e.g. label_en, label_mt) but have no special treatment.
Taxonomy – Hierarchy	Automatic inferencing, reasoning and can handle complex classes.	Can model hierarchies, but not model hierarchies of classes of individuals. Would require explicit traversal of classification hierarchies
Individual Relationships	Requires workarounds like reification and complex queries.	Can make direct assertions over them, natural representation and efficient querying.
Property Inheritance	Properties inherited through defined class hierarchies. Furthermore, the RDF data model has the ability to represent subproperties.	Must be handled in application logic.
N—ary Relations	Generally binary relationships are represented in triples, but N—ary relations can be done via blank nodes, additional resources, or reification.	Can often be translated to additional attributes on edges.
Property Constraints and Validation	Available through schema definitions: RDFS, OWL or SHACL.	Supports minimal constraints such as value uniqueness but generally requires validation through schema layers or application logic.
Context and Provenance	Can be done in various ways, including having a separate named graph and links to the main resources, or through reification.	Can add properties to nodes and edges to capture context and provenance.
Inferencing	Automate the inferencing of inverse relationships, transitive patterns, complex property chains, disjointness and negation.	Either require explicit definition, in application logic, or no support at all (disjointness and negation).

Semantics in Graphs — A Family Tree Example

A comprehensive exploration of the application of RDF data model and semantics within an LPG application can be found in various articles published on Medium, LinkedIn, and other blogs. As outlined in the previous section, the LPG data model is not specifically designed for reasoning purposes. Reasoning involves applying logical rules on existing facts as a way to deduce new knowledge; this is important as it helps uncover hidden relationships that were not explicitly stated before.

In this section we will demonstrate how axioms are defined for a simple yet practical example of a family tree. A family tree is an ideal candidate for any graph database due to its hierarchical structure and its flexibility in being defined within any data model. For this demonstration, we will model the Pewterschmidt family, which is a fictional family from the popular animated television series Family Guy.

All images, unless otherwise noted, are by the author.

In this case, we are just creating one relationship called ‘hasChild’. So, Carter has a child named Lois, and so on. The only other attribute we’re adding is the gender (Male/Female). For the RDF data model, we have created a simple OWL ontology:

The current schema enables us to represent the family tree in an RDF data model. With ontologies, we can commence defining the following properties, whose data can be deduced from the initial data. We introduce the following properties:

Property	Comment	Axiom	Example
isAncestorOf	A transitive property which is also the inverse of the isDescendentOf property. OWL engines automatically infer transitive properties without the need of rules.	hasChild(?x, ?y) —> isAncestorOf(?x, ?y)	Carter – isAncestorOf —> Lois – isAncestorOf —> Chris Carter – isAncestorOf —> Chris
isDescendentOf	A transitive property, inverse of isAncestorOf. OWL engines automatically infers inverse properties without the need of rules	—	Chris – isDescendentOf —> Peter
isBrotherOf	A subproperty of isSiblingOf and disjoint with isSisterOf, meaning that the same person cannot be the brother and the sister of another person at the same time, whilst they cannot be the brother of themselves.	hasChild(?x, ?y), hasChild(?x, ?z), hasGender(?y, Male), notEqual(?y, ?z) —> isBrotherOf(?y, ?z)	Chris – isBrotherOf —> Meg
isSisterOf	A subproperty of isSiblingOf and disjoint with isBrotherOf, meaning that the same person cannot be the brother and the sister or another person at the same time, whilst they cannot be the brother of themselves.	hasChild(?x, ?y), hasChild(?x, ?z), hasGender(?y, Female), notEqual(?y, ?z) —> isSisterOf(?y, ?z)	Meg – isSisterOf —> Chris
isSiblingOf	A super—property of isBrotherOf and isSisterOf. OWL engines automatically infers super—properties	—	Chris – isSiblingOf —> Meg
isNephewOf	A property that infers the aunts and uncles of children based on their gender.	isSiblingOf(?x, ?y), hasChild(?x, ?z), hasGender(?z, Male), notEqual(?y, ?x) —> isNephewOf(?z, ?y	Stewie – isNephewOf —> Carol
isNieceOf	A property that infers the aunts and uncles of children based on their gender.	isSiblingOf(?x, ?y), hasChild(?x, ?z), hasGender(?z, Female), notEqual(?y, ?x) —> isNieceOf(?z, ?y)	Meg – isNieceOf —> Carol

These axioms are imported into a triple store, to which the engine will apply them to the explicit facts in real—time. Through these axioms, triple stores allow the querying of inferred/hidden triples.. Therefore, if we want to get the explicit information about Chris Griffin, the following query can be executed:

SELECT ?p ?o WHERE {
  ?p ?o EXPLICIT true
}

If we need to get the inferred values for Chris, the SPARQL engine will provide us with 10 inferred facts:

SELECT ?p ?o WHERE {
  ?p ?o EXPLICIT false
}

This query will return all implicit facts for Chris Griffin. The image below shows the discovered facts. These are not explicitly stored in the triple store.

These results could not be produced by the property graph store, as no reasoning could be applied automatically.

The RDF data model empowers users to discover previously unknown facts, a capability that the LPG data model lacks. Nevertheless, LPG implementations can bypass this limitation by developing complex stored procedures. However, unlike in RDF, these stored procedures may have variations (if at all possible) across different vendor implementations, rendering them non—portable and impractical.

Take-home message

In this article, the RDF and LPG data models have been presented objectively. On the one hand, the LPG data model offers a rapid deployment of graph databases without the need for an advanced schema to be defined (i.e. it is schema—less). Conversely, the RDF data model requires a more time—consuming bootstrapping process for graph data, or knowledge graph, due to its schema definition requirement. However, the decision to adopt one model over the other should consider whether the additional effort is justified in providing meaningful context to the data. This consideration is influenced by specific use cases. For instance, in social networks where neighbourhood exploration is a primary requirement, the LPG data model may be more suitable. On the other hand, for more advanced knowledge graphs that necessitate reasoning or data integration across multiple sources, the RDF data model is the preferred choice.

It is crucial to avoid letting personal preferences for query languages dictate the choice of data model. Regrettably, many articles available primarily serve as marketing tools rather than educational resources, hindering adoption and creating confusion within the graph database community. Furthermore, in the era of abundant and accessible information, it would be better for vendors to refrain from promoting misinformation about opposing data models. A general misconception promoted by property graph evangelists is that the RDF data model is overly complex and academic, leading to its dismissal. This assertion is based on a preferential prejudice. RDF is both a machine and human readable data model that is close to business language, especially through the definition of schemas and ontologies. Moreover, the adoption of the RDF data model is widespread. For instance, Google uses the RDF data model as their standard to represent meta—information about web pages using schema.org. There is also the assumption that the RDF data model will exclusively function with a schema. This is also a misconception, as after all, the data defined using the RDF data model could also be schema—less. However, it is acknowledged that all semantics would be lost, and the data will be reduced to simply graph data. This article also mentions how the oneGraph vision aims to establish a bridge between the two data models.

To conclude, technical feasibility alone should not drive implementation decisions in which graph data model to select. Reducing higher—level abstractions to primitive constructs often increases complexity and can impede solving specific use cases effectively. Decisions should be guided by use case requirements and performance considerations rather than merely what is technically possible.

The author would like to thank Matteo Casu for his input and review. This article is dedicated to Norm Friend, whose untimely demise left a void in the Knowledge Graph community.

¹ Schemas and ontologies are used interchangeably in this article.
² Lassila, O. et al. The OneGraph Vision: Challenges of Breaking the Graph Model Lock—In. https://www.semantic-web-journal.net/system/files/swj3273.pdf.
³ Broekema, W. et al. openCypher Queries over Combined RDF and LPG Data in Amazon Neptune. https://ceur-ws.org/Vol-3828/paper44.pdf.
⁴ https://www.w3.org/DesignIssues/LinkedData.html
⁵ https://www.go-fair.org/fair-principles

The post Let’s Call a Spade a Spade: RDF and LPG — Cousins Who Should Learn to Live Together appeared first on Towards Data Science.

Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data

Mateus Trentz — Fri, 04 Apr 2025 00:16:48 +0000

I’m definitely not the only person who feels that YouTube sponsor segments have become longer and more frequent recently. Sometimes, I watch videos that seem to be trying to sell me something every couple of seconds.

On one hand, it’s great that both small and medium-sized YouTubers are able to make a living from their craft, but on the other hand, it sure is annoying to be bombarded by ads.

In this blog post, I will explore these sponsor segments, using data from a popular browser extension called SponsorBlock, to figure out if the perceived increase in ads actually did happen and also to quantify how many ads I’m watching.

I will walk you through my analysis, providing code snippets in Sql, DuckDB, and pandas. All the code is available on my GitHub, and since the dataset is open, I will also teach you how to download it, so that you can follow along and play with the data yourself.

These are the questions I will be trying to answer in this analysis:

Have sponsor segments increased over the years?
Which channels have the highest percentage of sponsor time per video?
What is the density of sponsor segments throughout a video?

To get to these answers, we will have to cover much ground. This is the agenda for this post:

How SponsorBlock Works
Cleaning the Data
Exploring and Enhancing the Data
Answering Each of the Questions
Conclusion

Let’s get this started!

How SponsorBlock Works

SponsorBlock is an extension that allows you to skip ad segments in videos, similar to how you skip Netflix intros. It’s incredibly accurate, as I don’t remember seeing one wrong segment since I started using it around a month ago, and I watch a lot of smaller non-English creators.

You might be asking yourself how the extension knows which parts of the video are sponsors, and, believe it or not, the answer is through crowdsourcing!

Users submit the timestamps for the ad segments, and other users vote if it’s accurate or not. For the average user, who isn’t contributing at all, the only thing you have to do is to press Enter to skip the ad.

Okay, now that you know what SponsorBlock is, let’s talk about the data.

Cleaning the Data

If you want to follow along, you can download a copy of the data using this SponsorBlock Mirror (it might take you quite a few minutes to download it all). The database schema can be seen here, although most of it won’t be useful for this project.

As one might expect, their database schema is made for the extension to work properly, and not for some guy to basically leech from a huge community effort to find what percentage of ads his favorite creator runs. For this, some work will need to be done to clean and model the data.

The only two tables that are important for this analysis are:

sponsorTimes.csv : This is the most important table, containing the startTime and endTime of all crowdsourced sponsor segments. The CSV is around 5GB.
videoInfo.csv : Contains the video title, publication date, and channel ID associated with each video.

Before we get into it, these are all the libraries I ended up using. I will explain the less obvious ones as we go.

pandas
duckdb
requests
requests-cache
python-dotenv
seaborn
matplotlib
numpy

The first step, then, is to load the data. Surprisingly, this was already a bit challenging, as I was getting a lot of errors parsing some rows of the CSV. These were the settings I found to work for the majority of the rows:

import duckdb
import os

# Connect to an in-memory DuckDB instance
con = duckdb.connect(database=':memory:')

sponsor_times = con.read_csv(
    "sb-mirror/sponsorTimes.csv",
    header=True,
    columns={
        "videoID": "VARCHAR",
        "startTime": "DOUBLE",
        "endTime": "DOUBLE",
        "votes": "INTEGER",
        "locked": "INTEGER",
        "incorrectVotes": "INTEGER",
        "UUID": "VARCHAR",
        "userID": "VARCHAR",
        "timeSubmitted": "DOUBLE",
        "views": "INTEGER",
        "category": "VARCHAR",
        "actionType": "VARCHAR",
        "service": "VARCHAR",
        "videoDuration": "DOUBLE",
        "hidden": "INTEGER",
        "reputation": "DOUBLE",
        "shadowHidden": "INTEGER",
        "hashedVideoID": "VARCHAR",
        "userAgent": "VARCHAR",
        "description": "VARCHAR",
    },
    ignore_errors=True,
    quotechar="",
)

video_info = con.read_csv(
    "sb-mirror/videoInfo.csv",
    header=True,
    columns={
        "videoID": "VARCHAR",
        "channelID": "VARCHAR",
        "title": "VARCHAR",
        "published": "DOUBLE",
    },
    ignore_errors=True,
    quotechar=None,
)

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Here is what a sample of the data looks like:

con.sql("SELECT videoID, startTime, endTime, votes, locked, category FROM sponsor_times LIMIT 5")

con.sql("SELECT * FROM video_info LIMIT 5")

Sample of sponsorTimes.csv

Sample of videoInfo.csv

Understanding the data in the sponsorTimes table is ridiculously important, otherwise, the cleaning process won’t make any sense.

Each row represents a user-submitted timestamp for a sponsored segment. Since multiple users can submit segments for the same video, the dataset contains duplicate and potentially incorrect entries, which will need to be dealt with during cleaning.

To find incorrect segments, I will use the votes and the locked column, as the latter one represents segments that were confirmed to be correct.

Another important column is the category. There are a bunch of categories like Intro, Outro, Filler, etc. For this analysis, I will only work with Sponsor and Self-Promo.

I started by applying some filters:

CREATE TABLE filtered AS
SELECT
    *
FROM sponsor_times
WHERE category IN ('sponsor', 'selfpromo') AND (votes > 0 OR locked=1)

Filtering for locked segments or segments with more than 0 votes was a big decision. This reduced the dataset by a huge percentage, but doing so made the data very reliable. For example, before doing this, all of the Top 50 channels with the highest percentage of ads were just spam, random channels that ran 99.9% of ads.

With this done, the next step is to get a dataset where each sponsor segment shows up only once. For example, a video with a sponsor segment at the beginning and another at the end should have only two rows of data.

This is very much not the case so far, since in one video we can have multiple user-submitted entries for each segment. To do this, I will use window functions to identify if two or more rows of data represent the same segment.

The first window function compares the startTime of one row with the endTime of the previous. If these values don’t overlap, it means they are entries for separate segments, otherwise they are repeated entries for the same segment.

CREATE TABLE new_segments AS
SELECT
    -- Coalesce to TRUE to deal with the first row of every window
    -- as the values are NULL, but it should count as a new segment.
    COALESCE(startTime > LAG(endTime) 
      OVER (PARTITION BY videoID ORDER BY startTime), true) 
      AS new_ad_segment,
    *
FROM filtered

Window Function example for a single video.

The new_ad_segment column is TRUE every time a row represents a new segment of a video. The first two rows, as their timestamps overlap, are properly marked as the same segment.

Next up, the second window function will label each ad segment by number:

CREATE TABLE ad_segments AS
SELECT
    SUM(new_ad_segment) 
      OVER (PARTITION BY videoID ORDER BY startTime)
      AS ad_segment,
    *
FROM new_segments

Example of labels for ad segments for a single video.

Finally, now that each segment is properly numbered, it’s easy to get the segment that is either locked or has the highest amount of votes.

CREATE TABLE unique_segments AS
SELECT DISTINCT ON (videoID, ad_segment)
    *
FROM ad_segments
ORDER BY videoID, ad_segment, locked DESC, votes DESC

Example of what the final dataset looks like for a single video.

That’s it! Now this table has one row for each unique ad segment, and I can start exploring the data.

If these queries feel complicated, and you need a refresher on window functions, check out this blog post that will teach you all you need to know about them! The last example covered in the blog post is almost exactly the process I used here.

Exploring and Enhancing the Data

Finally, the dataset is good enough to start exploring. The first thing I did was to get a sense of the size of the data:

36.0k Unique Channels
552.6k Unique Videos
673.8k Unique Sponsor Segments, for an average of 1.22 segments per video

As mentioned earlier, filtering by segments that were either locked or had at least 1 upvote, reduced the dataset massively, by around 80%. But this is the price I had to pay to have data that I could work with.

To check if there is nothing immediately wrong with the data, I gathered the channels that have the most amount of videos:

CREATE TABLE top_5_channels AS 
SELECT
    channelID,
    count(DISTINCT unique_segments.videoID) AS video_count
FROM
    unique_segments
    LEFT JOIN video_info ON unique_segments.videoID = video_info.videoID 
WHERE
    channelID IS NOT NULL
    -- Some channel IDs are blank
    AND channelID != '""'
GROUP BY
    channelID
ORDER BY
    video_count DESC
LIMIT 5

The amount of videos per channel looks realistic… But this is terrible to work with. I don’t want to go to my browser and look up channel IDs every time I want to know the name of a channel.

To fix this, I created a small script with functions to get these values from the YouTube API in Python. I’m using the library requests_cache to make sure I won’t be repeating API calls and depleting the API limits.

import requests
import requests_cache
from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv("YT_API_KEY")

# Cache responses indefinitely
requests_cache.install_cache("youtube_cache", expire_after=None)

def get_channel_name(channel_id: str) -> str:
    url = (
        f"https://www.googleapis.com/youtube/v3/channels"
        f"?part=snippet&id={channel_id}&key={API_KEY}"
    )
    response = requests.get(url)
    data = response.json()

    try:
        return data.get("items", [])[0].get("snippet", {}).get("title", "")
    except (IndexError, AttributeError):
        return ""

Besides this, I also created very similar functions to get the country and thumbnail of each channel, which will be useful later. If you’re interested in the code, check the GitHub repo.

On my DuckDB code, I’m now able to register this Python function and call them within SQL! I just need to be very careful to always use them on aggregated and filtered data, otherwise, I can say bye-bye to my API quota.

# This the script created above
from youtube_api import get_channel_name

# Try registering the function, ignore if already exists
try:
    con.create_function('get_channel_name', get_channel_name, [str], str)
except Exception as e:
    print(f"Skipping function registration (possibly already exists): {e}")

# Get the channel names
channel_names = con.sql("""
    select
        channelID,
        get_channel_name(channelID) as channel_name,
        video_count
    from top_5_channels
""")

Much better! I looked up two channels that I’m familiar with on YouTube for a quick sanity check. Linus Tech Tips has a total of 7.2k videos uploaded, with 2.3k present in this dataset. Gamers Nexus has 3k videos, with 700 in the dataset. Looks good enough for me!

The last thing to do, before moving over to actually answering the question I set myself to answer, is to have an idea of the average duration of videos.

This matches my expectations, for the most part. I’m still a bit surprised by the amount of 20-40-minute videos, as for many years the “meta” was to have videos of 10 minutes to maximize YouTube’s own ads.

Also, I thought those buckets of video durations used in the previous graph were quite representative of how I think about video lengths, so I will be sticking with them for the next sections.

For reference, this is the pandas code used to create those buckets.

video_lengths = con.sql("""
  SELECT DISTINCT ON (videoID)
      videoID,
      videoDuration
  FROM
      unique_segments
  WHERE
      videoID IS NOT NULL
      AND videoDuration > 0
"""
).df()

# Define custom bins, in minutes
bins = [0, 3, 7, 12, 20, 40, 90, 180, 600, 9999999] 
labels = ["0-3", "3-7", "7-12", "12-20", "20-40", "40-90", "90-180", "180-600", "600+"]

# Assign each video to a bucket (trasnform duration to min)
video_lengths["duration_bucket"] = pd.cut(video_lengths["videoDuration"] / 60, bins=bins, labels=labels, right=False)

The big question. This will prove if I’m being paranoid or not about everyone trying to sell me something at all times. I will start, though, by answering a simpler question, which is the percentage of sponsors for different video durations.

My expectation is that shorter videos have a higher share of their runtime from sponsors in comparison to longer videos. Let’s check if this is actually the case.

CREATE TABLE video_total_ads AS
SELECT
    videoID,
    MAX(videoDuration) AS videoDuration,
    SUM(endTime - startTime) AS total_ad_duration,
    SUM(endTime - startTime) / 60 AS ad_minutes,
    SUM(endTime - startTime) / MAX(videoDuration) AS ad_percentage,
    MAX(videoDuration) / 60 AS video_duration_minutes
FROM
    unique_segments
WHERE
    videoDuration > 0
    AND videoDuration < 5400
    AND videoID IS NOT NULL
GROUP BY
    videoID

To keep the visualization simple, I’m applying similar buckets, but only up to 90 minutes.

# Define duration buckets (in minutes, up to 90min)
bins = [0, 3, 7, 12, 20, 30, 40, 60, 90]    
labels = ["0-3", "3-7", "7-12", "12-20", "20-30", "30-40", "40-60", "60-90"]

video_total_ads = video_total_ads.df()

# Apply the buckets again
video_total_ads["duration_bucket"] = pd.cut(video_total_ads["videoDuration"] / 60, bins=bins, labels=labels, right=False)

# Group by bucket and sum ad times and total durations
bucket_data = video_total_ads.groupby("duration_bucket")[["ad_minutes", "videoDuration"]].sum()

# Convert to percentage of total video time
bucket_data["ad_percentage"] = (bucket_data["ad_minutes"] / (bucket_data["videoDuration"] / 60)) * 100
bucket_data["video_percentage"] = 100 - bucket_data["ad_percentage"]

As expected, if you’re watching shorter-form content on YouTube, then around 10% of it is sponsored! Videos of 12–20 min in duration have 6.5% of sponsors, while 20–30 min have only 4.8%.

To move forward to the year-by-year analysis I need to join the sponsor times with the videoInfo table.

CREATE TABLE video_total_ads_joined AS
SELECT
    *
FROM
    video_total_ads
LEFT JOIN video_info ON video_total_ads.videoID = video_info.videoID

Next, let’s just check how many videos we have per year:

SELECT
    *,
    to_timestamp(NULLIF (published, 0)) AS published_date,
    extract(year FROM to_timestamp(NULLIF (published, 0))) AS published_year
FROM
    video_total_ads

Not good, not good at all. I’m not exactly sure why but there are a lot of videos that didn’t have the timestamp recorded. It seems that only in 2021 and 2022 videos were reliably stored with their published date.

I do have some ideas on how I can improve this dataset with other public data, but it’s a very time-consuming process and I will leave this for a future blog post. I don’t intend to settle for an answer based on limited data, but for now, I will have to make do with what I have.

I chose to keep the analysis between the years 2018 and 2023, given that those years had more data points.

# Limiting the years as for these here I have a decent amount of data.
start_year = 2018
end_year = 2023

plot_df = (
    video_total_ads_joined.df()
    .query(f"published_year >= {start_year} and published_year <= {end_year}")
    .groupby(["published_year", "duration_bucket"], as_index=False)
    [["ad_minutes", "video_duration_minutes"]]
    .sum()
)

# Calculate ad_percentage & content_percentage
plot_df["ad_percentage"] = (
    plot_df["ad_minutes"] / plot_df["video_duration_minutes"] * 100
)
plot_df["content_percentage"] = 100 - plot_df["ad_percentage"]

There is a steep increase in ad percentage, especially from 2020 to 2021, but afterward, it plateaus, especially for longer videos. This makes a lot of sense since during those years online advertisement grew a lot as people spent more and more time at home.

For shorter videos, there does seem to be an increase from 2022 to 2023. But as the data is limited, and I don’t have data for 2024, I can’t get a conclusive answer to this.

Next up, let’s move into questions that don’t depend on the publishing date, this way I can work with a larger portion of the dataset.

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

This is a fun one for me, as I wonder if the channels I actively watch are the ones that run the most ads.

Continuing from the table created previously, I can easily group the ad and video amount by channel:

CREATE TABLE ad_percentage_per_channel AS
SELECT
    channelID,
    sum(ad_minutes) AS channel_total_ad_minutes,
    sum(videoDuration) / 60 AS channel_total_video_minutes
FROM
    video_total_ads_joined
GROUP BY
    channelID

I decided to filter for channels that had at least 30 minutes of videos in the data, as a way of eliminating outliers.

SELECT
    channelID,
    channel_total_video_minutes,
    channel_total_ad_minutes,
    channel_ad_percentage
FROM
    ad_percentage_per_channel
WHERE
    -- At least 30 minutes of video
    channel_total_video_minutes > 1800
    AND channelID IS NOT NULL
ORDER BY
    channel_ad_percentage DESC
LIMIT 50

As quickly mentioned earlier, I also created some functions to get the country and thumbnail of channels. This allowed me to create this visualization.

I’m not sure if this surprised me or not. Some of the channels on this list I watch very frequently, especially Gaveta (#31), a Brazilian YouTuber who covers movies and film editing.

I also know that both he and Corridor Crew (#32) do a lot of self-sponsor, promoting their own content and products, so maybe this is also the case for other channels!

In any case, the data seems good, and the percentages seem to match my manual checks and personal experience.

I would love to know if channels that you watch were present in this list, and if it surprised you or not!

If you want to see the Top 150 Creators, subscribe to my free newsletter, as I will be publishing the full list as well as more information about this analysis in there!

Have you ever thought about at which point of the video ads work best? People probably just skip sponsor segments placed at the beginning, and just move on and close the video for those placed at the end.

From personal experience, I feel that I’m more likely to watch an ad if it plays around the middle of a video, but I don’t think this is what creators do in most cases.

My goal, then, is to create a heatmap that shows the density of ads during a video runtime. Doing this was surprisingly not obvious, and the solution that I found was so clever that it kinda blew my mind. Let me show you.

This is the data needed for this analysis. One row per ad, with the timestamp when each segment starts and ends:

The first step is to normalize the intervals, e.g., I don’t care that an ad started at 63s, what I want to know is if it started at 1% of the video runtime or 50% of the video runtime.

CREATE TABLE ad_intervals AS
SELECT
    videoID,
    startTime,
    endTime,
    videoDuration,
    startTime / videoDuration AS start_fraction,
    endTime / videoDuration AS end_fraction
FROM
    unique_segments
WHERE
    -- Just to make sure we don't have bad data
    videoID IS NOT NULL
    AND startTime >= 0
    AND endTime <= videoDuration
    AND startTime < endTime
    -- Less than 40h
    AND videoDuration < 144000

Great, now all intervals are comparable, but the problem is far from solved.

I want you to think, how would you solve this? If I asked you “At 10% runtime out of all videos, how many ads are running?”

I do not believe that this is an obvious problem to solve. My first instinct was to create a bunch of buckets, and then, for each row, I would ask “Is there an ad running at 1% of the runtime? What about at 2%? And so on…”

This seemed like a terrible idea, though. I wouldn’t be able to do it in SQL, and the code to solve it would be incredibly messy. In the end, the implementation of the solution I found was remarkably simple, using the Sweep Line Algorithm, which is an algorithm that is often used in programming interviews and puzzles.

I will show you how I solved it but don’t worry if you don’t understand what is happening. I will share other resources for you to learn more about it later on.

The first thing to do is to transform each interval (startTime, endTime) into two events, one that will count as +1 when the ad starts, and another that will count as -1 when the ad finishes. Afterward, just order the dataset by the “start time”.

CREATE TABLE ad_events AS
WITH unioned as (
  -- This is the most important step.
  SELECT
      videoID,
      start_fraction as fraction,
      1 as delta
  FROM ad_intervals
  UNION ALL
  SELECT
      videoID,
      end_fraction as fraction,
      -1 as delta
  FROM ad_intervals
), ordered AS (
  SELECT
      videoID,
      fraction,
      delta
  FROM ad_events
  ORDER BY fraction, delta
)
SELECT * FROM ordered

Now it’s already much easier to see the path forward! All I have to do is use a running sum on the delta column, and then, at any point of the dataset, I can know how many ads are running!

For example, if from 0s to 10s three ads started, but two of those also finished, I would have a delta of +3 and then -2, which means that there is only one ad currently running!

Going forward, and to simplify the data a bit, I first round the fractions to 4 decimal points and aggregate them. This is not necessary, but having too many rows was a problem when trying to plot the data. Finally, I divide the amount of running ads by the total amount of videos, to have it as a percentage.

CREATE TABLE ad_counter AS 
WITH rounded_and_grouped AS (
  SELECT
      ROUND(fraction, 4) as fraction,
      SUM(delta) as delta
  FROM ad_events
  GROUP BY ROUND(fraction, 4)
  ORDER BY fraction
), running_sum AS (
  SELECT
      fraction,
      SUM(delta) OVER (ORDER BY fraction) as ad_counter
  FROM rounded_and_grouped
), density AS (
  SELECT
      fraction,
      ad_counter,
      ad_counter / (SELECT COUNT(DISTINCT videoID) FROM unique_segments_filtered) as density
  FROM running_sum
)
SELECT * FROM density

With this data not only do I know that at the beginning of the videos (0.0% fraction), there are 69987 videos running ads, this also represents 17% of all videos in the dataset.

Now I can finally plot it as a heatmap:

As expected, the bumps at the extremities show that it’s way more common for channels to run ads at the beginning and end of the video. It’s also interesting that there is a plateau around the middle of the video, but then a drop, as the second half of the video is generally more ad-free.

What I found funny is that it’s apparently common for some videos to start straight away with an ad. I couldn’t picture this, so I manually checked 10 videos and it’s actually true… I’m not sure how representative it is, but most of the ones that I opened were gaming-related and in Russian, and they started directly with ads!

Before we move on to the conclusions, what did you think of the solution to this problem? I was surprised at how simple was doing this with the Sweep Line trick. If you want to know more about it, I recently published a blog post covering some SQL Patterns, and the last one is exactly this problem! Just repackaged in the context of counting concurrent meetings.

Conclusion

I really enjoyed doing this analysis since the data feels very personal to me, especially because I’ve been addicted to YouTube lately. I also feel that the answers I found were quite satisfactory, at least for the most part. To finish it off, let’s do a last recap!

Have Sponsor Segments Increased Over the Years?

There was a clear increase from 2020 to 2021. This was an effect that happened throughout all digital media and it’s clearly shown in this data. In more recent years, I can’t say whether there was an increase or not, as I don’t have enough data to be confident.

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

I got to create a very convincing list of the Top 50 channels that run the highest amount of ads. And I discovered that some of my favorite creators are the ones that spend the most amount of time trying to sell me something!

What is the density of sponsor segments throughout a video?

As expected, most people run ads at the beginning and the end of videos. Besides this, a lot of creators run ads around the middle of the video, making the second half slightly more ad-free.

Also, there are YouTubers who immediately start a video with ads, which I think it’s a crazy strategy.

Other Learnings and Next Steps

I liked how clear the data was in showing the percentage of ads in different video sizes. Now I know that I’m probably spending 5–6% of my time on YouTube watching ads if I’m not skipping them since I mostly watch videos that are 10–20 min.

I’m still not fully happy though with the year-by-year analysis. I’ve already looked into other data and downloaded more than 100 GB of YouTube metadata datasets. I’m confident that I can use it, together with the YouTube API, to fill some gaps and get a more convincing answer to my question.

Visualization Code

You might have noticed that I didn’t provide snippets to plot the charts shown here. This was on purpose to make the blog post more readable, as matplotlib code occupies a lot of space.

You can find all the code in my GitHub repo, that way you can copy my charts if you want to.

That’s it for this one! I really hope you enjoyed reading this blog post and learned something new!

If you’re curious about interesting topics that didn’t make it into this post, or enjoy learning about data, subscribe to my free newsletter on Substack. I publish whenever I have something genuinely interesting to share.

Want to connect directly or have questions? Reach out anytime at mtrentz.com.

All images and animations by the author unless stated otherwise.

The post Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data appeared first on Towards Data Science.