Python | Towards Data Science

Are You Sure Your Posterior Makes Sense?

Felipe Bandeira — Fri, 11 Apr 2025 18:38:41 +0000

This article is co-authored by Felipe Bandeira, Giselle Fretta, Thu Than, and Elbion Redenica. We also thank Prof. Carl Scheffler for his support.

Introduction

Parameter estimation has been for decades one of the most important topics in statistics. While frequentist approaches, such as Maximum Likelihood Estimations, used to be the gold standard, the advance of computation has opened space for Bayesian methods. Estimating posterior distributions with Mcmc samplers became increasingly common, but reliable inferences depend on a task that is far from trivial: making sure that the sampler — and the processes it executes under the hood — worked as expected. Keeping in mind what Lewis Caroll once wrote: “If you don’t know where you’re going, any road will take you there.”

This article is meant to help data scientists evaluate an often overlooked aspect of Bayesian parameter estimation: the reliability of the sampling process. Throughout the sections, we combine simple analogies with technical rigor to ensure our explanations are accessible to data scientists with any level of familiarity with Bayesian methods. Although our implementations are in Python with PyMC, the concepts we cover are useful to anyone using an MCMC algorithm, from Metropolis-Hastings to NUTS.

Key Concepts

No data scientist or statistician would disagree with the importance of robust parameter estimation methods. Whether the objective is to make inferences or conduct simulations, having the capacity to model the data generation process is a crucial part of the process. For a long time, the estimations were mainly performed using frequentist tools, such as Maximum Likelihood Estimations (MLE) or even the famous Least Squares optimization used in regressions. Yet, frequentist methods have clear shortcomings, such as the fact that they are focused on point estimates and do not incorporate prior knowledge that could improve estimates.

As an alternative to these tools, Bayesian methods have gained popularity over the past decades. They provide statisticians not only with point estimates of the unknown parameter but also with confidence intervals for it, all of which are informed by the data and by the prior knowledge researchers held. Originally, Bayesian parameter estimation was done through an adapted version of Bayes’ theorem focused on unknown parameters (represented as θ) and known data points (represented as x). We can define P(θ|x), the posterior distribution of a parameter’s value given the data, as:

\[ P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)} \]

In this formula, P(x|θ) is the likelihood of the data given a parameter value, P(θ) is the prior distribution over the parameter, and P(x) is the evidence, which is computed by integrating all possible values of the prior:

\[ P(x) = \int_\theta P(x, \theta) d\theta \]

In some cases, due to the complexity of the calculations required, deriving the posterior distribution analytically was not possible. However, with the advance of computation, running sampling algorithms (especially MCMC ones) to estimate posterior distributions has become easier, giving researchers a powerful tool for situations where analytical posteriors are not trivial to find. Yet, with such power also comes a large amount of responsibility to ensure that results make sense. This is where sampler diagnostics come in, offering a set of valuable tools to gauge 1) whether an MCMC algorithm is working well and, consequently, 2) whether the estimated distribution we see is an accurate representation of the real posterior distribution. But how can we know so?

How samplers work

Before diving into the technicalities of diagnostics, we shall cover how the process of sampling a posterior (especially with an MCMC sampler) works. In simple terms, we can think of a posterior distribution as a geographical area we haven’t been to but need to know the topography of. How can we draw an accurate map of the region?

One of our favorite analogies comes from Ben Gilbert. Suppose that the unknown region is actually a house whose floorplan we wish to map. For some reason, we cannot directly visit the house, but we can send bees inside with GPS devices attached to them. If everything works as expected, the bees will fly around the house, and using their trajectories, we can estimate what the floor plan looks like. In this analogy, the floor plan is the posterior distribution, and the sampler is the group of bees flying around the house.

The reason we are writing this article is that, in some cases, the bees won’t fly as expected. If they get stuck in a certain room for some reason (because someone dropped sugar on the floor, for example), the data they return won’t be representative of the entire house; rather than visiting all rooms, the bees only visited a few, and our picture of what the house looks like will ultimately be incomplete. Similarly, when a sampler does not work correctly, our estimation of the posterior distribution is also incomplete, and any inference we draw based on it is likely to be wrong.

Monte Carlo Markov Chain (MCMC)

In technical terms, we call an MCMC process any algorithm that undergoes transitions from one state to another with certain properties. Markov Chain refers to the fact that the next state only depends on the current one (or that the bee’s next location is only influenced by its current place, and not by all of the places where it has been before). Monte Carlo means that the next state is chosen randomly. MCMC methods like Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo (HMC), and No-U-Turn Sampler (NUTS) all operate by constructing Markov Chains (a sequence of steps) that are close to random and gradually explore the posterior distribution.

Now that you understand how a sampler works, let’s dive into a practical scenario to help us explore sampling problems.

Case Study

Imagine that, in a faraway nation, a governor wants to understand more about public annual spending on healthcare by mayors of cities with less than 1 million inhabitants. Rather than looking at sheer frequencies, he wants to understand the underlying distribution explaining expenditure, and a sample of spending data is about to arrive. The problem is that two of the economists involved in the project disagree about how the model should look.

Model 1

The first economist believes that all cities spend similarly, with some variation around a certain mean. As such, he creates a simple model. Although the specifics of how the economist chose his priors are irrelevant to us, we do need to keep in mind that he is trying to approximate a Normal (unimodal) distribution.

\[
x_i \sim \text{Normal}(\mu, \sigma^2) \text{ i.i.d. for all } i \\
\mu \sim \text{Normal}(10, 2) \\
\sigma^2 \sim \text{Uniform}(0,5)
\]

Model 2

The second economist disagrees, arguing that spending is more complex than his colleague believes. He believes that, given ideological differences and budget constraints, there are two kinds of cities: the ones that do their best to spend very little and the ones that are not afraid of spending a lot. As such, he creates a slightly more complex model, using a mixture of normals to reflect his belief that the true distribution is bimodal.

\[
x_i \sim \text{Normal-Mixture}([\omega, 1-\omega], [m_1, m_2], [s_1^2, s_2^2]) \text{ i.i.d. for all } i\\
m_j \sim \text{Normal}(2.3, 0.5^2) \text{ for } j = 1,2 \\
s_j^2 \sim \text{Inverse-Gamma}(1,1) \text{ for } j=1,2 \\
\omega \sim \text{Beta}(1,1)
\]

After the data arrives, each economist runs an MCMC algorithm to estimate their desired posteriors, which will be a reflection of reality (1) if their assumptions are true and (2) if the sampler worked correctly. The first if, a discussion about assumptions, shall be left to the economists. However, how can they know whether the second if holds? In other words, how can they be sure that the sampler worked correctly and, as a consequence, their posterior estimations are unbiased?

Sampler Diagnostics

To evaluate a sampler’s performance, we can explore a small set of metrics that reflect different parts of the estimation process.

Quantitative Metrics

R-hat (Potential Scale Reduction Factor)

In simple terms, R-hat evaluates whether bees that started at different places have all explored the same rooms at the end of the day. To estimate the posterior, an MCMC algorithm uses multiple chains (or bees) that start at random locations. R-hat is the metric we use to assess the convergence of the chains. It measures whether multiple MCMC chains have mixed well (i.e., if they have sampled the same topography) by comparing the variance of samples within each chain to the variance of the sample means across chains. Intuitively, this means that

\[
\hat{R} = \sqrt{\frac{\text{Variance Between Chains}}{\text{Variance Within Chains}}}
\]

If R-hat is close to 1.0 (or below 1.01), it means that the variance within each chain is very similar to the variance between chains, suggesting that they have converged to the same distribution. In other words, the chains are behaving similarly and are also indistinguishable from one another. This is precisely what we see after sampling the posterior of the first model, shown in the last column of the table below:

Figure 1. Summary statistics of the sampler highlighting ideal R-hats.

The r-hat from the second model, however, tells a different story. The fact we have such large r-hat values indicates that, at the end of the sampling process, the different chains had not converged yet. In practice, this means that the distribution they explored and returned was different, or that each bee created a map of a different room of the house. This fundamentally leaves us without a clue of how the pieces connect or what the complete floor plan looks like.

Figure 2. Summary statistics of the sampler showcasing problematic R-hats.

Given our R-hat readouts were large, we know something went wrong with the sampling process in the second model. However, even if the R-hat had turned out within acceptable levels, this does not give us certainty that the sampling process worked. R-hat is just a diagnostic tool, not a guarantee. Sometimes, even if your R-hat readout is lower than 1.01, the sampler might not have properly explored the full posterior. This happens when multiple bees start their exploration in the same room and remain there. Likewise, if you’re using a small number of chains, and if your posterior happens to be multimodal, there is a probability that all chains started in the same mode and failed to explore other peaks.

The R-hat readout reflects convergence, not completion. In order to have a more comprehensive idea, we need to check other diagnostic metrics as well.

Effective Sample Size (ESS)

When explaining what MCMC was, we mentioned that “Monte Carlo” refers to the fact that the next state is chosen randomly. This does not necessarily mean that the states are fully independent. Even though the bees choose their next step at random, these steps are still correlated to some extent. If a bee is exploring a living room at time t=0, it will probably still be in the living room at time t=1, even though it is in a different part of the same room. Due to this natural connection between samples, we say these two data points are autocorrelated.

Due to their nature, MCMC methods inherently produce autocorrelated samples, which complicates statistical analysis and requires careful evaluation. In statistical inference, we often assume independent samples to ensure that the estimates of uncertainty are accurate, hence the need for uncorrelated samples. If two data points are too similar to each other, the correlation reduces their effective information content. Mathematically, the formula below represents the autocorrelation function between two time points (t1 and t2) in a random process:

\[
R_{XX}(t_1, t_2) = E[X_{t_1} \overline{X_{t_2}}]
\]

where E is the expected value operator and X-bar is the complex conjugate. In MCMC sampling, this is crucial because high autocorrelation means that new samples don’t teach us anything different from the old ones, effectively reducing the sample size we have. Unsurprisingly, the metric that reflects this is called Effective Sample Size (ESS), and it helps us determine how many truly independent samples we have.

As hinted previously, the effective sample size accounts for autocorrelation by estimating how many truly independent samples would provide the same information as the autocorrelated samples we have. Mathematically, for a parameter θ, the ESS is defined as:

\[
ESS = \frac{n}{1 + 2 \sum_{k=1}^{\infty} \rho(\theta)_k}
\]

where n is the total number of samples and ρ(θ)k is the autocorrelation at lag k for parameter θ.

Typically, for ESS readouts, the higher, the better. This is what we see in the readout for the first model. Two common ESS variations are Bulk-ESS, which assesses mixing in the central part of the distribution, and Tail-ESS, which focuses on the efficiency of sampling the distribution’s tails. Both inform us if our model accurately reflects the central tendency and credible intervals.

Figure 3. Summary statistics of the sampler highlighting ideal quantities for ESS bulk and tail.

In contrast, the readouts for the second model are very bad. Typically, we want to see readouts that are at least 1/10 of the total sample size. In this case, given each chain sampled 2000 observations, we should expect ESS readouts of at least 800 (from the total size of 8000 samples across 4 chains of 2000 samples each), which is not what we observe.

Figure 4. Summary statistics of the sampler demonstrating problematic ESS bulk and tail.

Visual Diagnostics

Apart from the numerical metrics, our understanding of sampler performance can be deepened through the use of diagnostic plots. The main ones are rank plots, trace plots, and pair plots.

Rank Plots

A rank plot helps us identify whether the different chains have explored all of the posterior distribution. If we once again think of the bee analogy, rank plots tell us which bees explored which parts of the house. Therefore, to evaluate whether the posterior was explored equally by all chains, we observe the shape of the rank plots produced by the sampler. Ideally, we want the distribution of all chains to look roughly uniform, like in the rank plots generated after sampling the first model. Each color below represents a chain (or bee):

Figure 5. Rank plots for parameters ‘m’ and ‘s’ across four MCMC chains. Each bar represents the distribution of rank values for one chain, with ideally uniform ranks indicating good mixing and proper convergence.

Under the hood, a rank plot is produced with a simple sequence of steps. First, we run the sampler and let it sample from the posterior of each parameter. In our case, we are sampling posteriors for parameters m and s of the first model. Then, parameter by parameter, we get all samples from all chains, put them together, and order them from smallest to largest. We then ask ourselves, for each sample, what was the chain where it came from? This will allow us to create plots like the ones we see above.

In contrast, bad rank plots are easy to spot. Unlike the previous example, the distributions from the second model, shown below, are not uniform. From the plots, what we interpret is that each chain, after beginning at different random locations, got stuck in a region and did not explore the entirety of the posterior. Consequently, we cannot make inferences from the results, as they are unreliable and not representative of the true posterior distribution. This would be equivalent to having four bees that started at different rooms of the house and got stuck somewhere during their exploration, never covering the entirety of the property.

Figure 6. Rank plots for parameters m, s_squared, and w across four MCMC chains. Each subplot shows the distribution of ranks by chain. There are noticeable deviations from uniformity (e.g., stair-step patterns or imbalances across chains) suggesting potential sampling issues.

KDE and Trace Plots

Similar to R-hat, trace plots help us assess the convergence of MCMC samples by visualizing how the algorithm explores the parameter space over time. PyMC provides two types of trace plots to diagnose mixing issues: Kernel Density Estimate (KDE) plots and iteration-based trace plots. Each of these serves a distinct purpose in evaluating whether the sampler has properly explored the target distribution.

The KDE plot (usually on the left) estimates the posterior density for each chain, where each line represents a separate chain. This allows us to check whether all chains have converged to the same distribution. If the KDEs overlap, it suggests that the chains are sampling from the same posterior and that mixing has occurred. On the other hand, the trace plot (usually on the right) visualizes how parameter values change over MCMC iterations (steps), with each line representing a different chain. A well-mixed sampler will produce trace plots that look noisy and random, with no clear structure or separation between chains.

Using the bee analogy, trace plots can be thought of as snapshots of the “features” of the house at different locations. If the sampler is working correctly, the KDEs in the left plot should align closely, showing that all bees (chains) have explored the house similarly. Meanwhile, the right plot should show highly variable traces that blend together, confirming that the chains are actively moving through the space rather than getting stuck in specific regions.

Figure 7. Density and trace plots for parameters m and s from the first model across four MCMC chains. The left panel shows kernel density estimates (KDE) of the marginal posterior distribution for each chain, indicating consistent central tendency and spread. The right panel displays the trace plot over iterations, with overlapping chains and no apparent divergences, suggesting good mixing and convergence.

However, if your sampler has poor mixing or convergence issues, you will see something like the figure below. In this case, the KDEs will not overlap, meaning that different chains have sampled from different distributions rather than a shared posterior. The trace plot will also show structured patterns instead of random noise, indicating that chains are stuck in different regions of the parameter space and failing to fully explore it.

Figure 8. KDE (left) and trace plots (right) for parameters m, s_squared, and w across MCMC chains for the second model. Multimodal distributions are visible for m and w, suggesting potential identifiability issues. Trace plots reveal that chains explore different modes with limited mixing, particularly for m, highlighting challenges in convergence and effective sampling.

By using trace plots alongside the other diagnostics, you can identify sampling issues and determine whether your MCMC algorithm is effectively exploring the posterior distribution.

Pair Plots

A third kind of plot that is often useful for diagnostic are pair plots. In models where we want to estimate the posterior distribution of multiple parameters, pair plots allow us to observe how different parameters are correlated. To understand how such plots are formed, think again about the bee analogy. If you imagine that we’ll create a plot with the width and length of the house, each “step” that the bees take can be represented by an (x, y) combination. Likewise, each parameter of the posterior is represented as a dimension, and we create scatter plots showing where the sampler walked using parameter values as coordinates. Here, we are plotting each unique pair (x, y), resulting in the scatter plot you see in the middle of the image below. The one-dimensional plots you see on the edges are the marginal distributions over each parameter, giving us additional information on the sampler’s behavior when exploring them.

Take a look at the pair plot from the first model.

Figure 9. Joint posterior distribution of parameters m and s, with marginal densities. The scatter plot shows a roughly symmetric, elliptical shape, suggesting a low correlation between m and s.

Each axis represents one of the two parameters whose posteriors we are estimating. For now, let’s focus on the scatter plot in the middle, which shows the parameter combinations sampled from the posterior. The fact we have a very even distribution means that, for any particular value of m, there was a range of values of s that were equally likely to be sampled. Additionally, we don’t see any correlation between the two parameters, which is usually good! There are cases when we would expect some correlation, such as when our model involves a regression line. However, in this instance, we have no reason to believe two parameters should be highly correlated, so the fact we don’t observe unusual behavior is positive news.

Now, take a look at the pair plots from the second model.

Figure 10. Pair plot of the joint posterior distributions for parameters m, s_squared, and w. The scatter plots reveal strong correlations between several parameters.

Given that this model has five parameters to be estimated, we naturally have a greater number of plots since we are analyzing them pair-wise. However, they look odd compared to the previous example. Namely, rather than having an even distribution of points, the samples here either seem to be divided across two regions or seem somewhat correlated. This is another way of visualizing what the rank plots have shown: the sampler did not explore the full posterior distribution. Below we isolated the top left plot, which contains the samples from m0 and m1. Unlike the plot from model 1, here we see that the value of one parameter greatly influences the value of the other. If we sampled m1 around 2.5, for example, m0 is likely to be sampled from a very narrow range around 1.5.

Figure 11. Joint posterior distribution of parameters m₀ and m₁, with marginal densities.

Certain shapes can be observed in problematic pair plots relatively frequently. Diagonal patterns, for example, indicate a high correlation between parameters. Banana shapes are often connected to parametrization issues, often being present in models with tight priors or constrained parameters. Funnel shapes might indicate hierarchical models with bad geometry. When we have two separate islands, like in the plot above, this can indicate that the posterior is bimodal AND that the chains haven’t mixed well. However, keep in mind that these shapes might indicate problems, but not necessarily do so. It’s up to the data scientist to examine the model and determine which behaviors are expected and which ones are not!

Some Fixing Techniques

When your diagnostics indicate sampling problems — whether concerning R-hat values, low ESS, unusual rank plots, separated trace plots, or strange parameter correlations in pair plots — several strategies can help you address the underlying issues. Sampling problems typically stem from the target posterior being too complex for the sampler to explore efficiently. Complex target distributions might have:

Multiple modes (peaks) that the sampler struggles to move between
Irregular shapes with narrow “corridors” connecting different regions
Areas of drastically different scales (like the “neck” of a funnel)
Heavy tails that are difficult to sample accurately

In the bee analogy, these complexities represent houses with unusual floor plans — disconnected rooms, extremely narrow hallways, or areas that change dramatically in size. Just as bees might get trapped in specific regions of such houses, MCMC chains can get stuck in certain areas of the posterior.

Figure 12. Examples of multimodal target distributions.

Figure 13. Examples of weirdly shaped distributions.

To help the sampler in its exploration, there are simple strategies we can use.

Strategy 1: Reparameterization

Reparameterization is particularly effective for hierarchical models and distributions with challenging geometries. It involves transforming your model’s parameters to make them easier to sample. Back to the bee analogy, imagine the bees are exploring a house with a peculiar layout: a spacious living room that connects to the kitchen through a very, very narrow hallway. One aspect we hadn’t mentioned before is that the bees have to fly in the same way through the entire house. That means that if we dictate the bees should use large “steps,” they will explore the living room very well but hit the walls in the hallway head-on. Likewise, if their steps are small, they will explore the narrow hallway well, but take forever to cover the entire living room. The difference in scales, which is natural to the house, makes the bees’ job more difficult.

A classic example that represents this scenario is Neal’s funnel, where the scale of one parameter depends on another:

\[
p(y, x) = \text{Normal}(y|0, 3) \times \prod_{n=1}^{9} \text{Normal}(x_n|0, e^{y/2})
\]

Figure 14. Log the marginal density of y and the first dimension of Neal’s funnel. The neck is where the sampler is struggling to sample from and the step size is required to be much smaller than the body. (Image source: Stan User’s Guide)

We can see that the scale of x is dependent on the value of y. To fix this problem, we can separate x and y as independent standard Normals and then transform these variables into the desired funnel distribution. Instead of sampling directly like this:

\[
\begin{align*}
y &\sim \text{Normal}(0, 3) \\
x &\sim \text{Normal}(0, e^{y/2})
\end{align*}
\]

You can reparameterize to sample from standard Normals first:

\[
y_{raw} \sim \text{Standard Normal}(0, 1) \\
x_{raw} \sim \text{Standard Normal}(0, 1) \\
\\
y = 3y_{raw} \\
x = e^{y/2} x_{raw}
\]

This technique separates the hierarchical parameters and makes sampling more efficient by eliminating the dependency between them.

Reparameterization is like redesigning the house such that instead of forcing the bees to find a single narrow hallway, we create a new layout where all passages have similar widths. This helps the bees use a consistent flying pattern throughout their exploration.

Strategy 2: Handling Heavy-tailed Distributions

Heavy-tailed distributions like Cauchy and Student-T present challenges for samplers and the ideal step size. Their tails require larger step sizes than their central regions (similar to very long hallways that require the bees to travel long distances), which creates a challenge:

Small step sizes lead to inefficient sampling in the tails
Large step sizes cause too many rejections in the center

Figure 15. Probability density functions for various Cauchy distributions illustrate the effects of changing the location parameter and scale parameter. (Image source: Wikipedia)

Reparameterization solutions include:

For Cauchy: Defining the variable as a transformation of a Uniform distribution using the Cauchy inverse CDF
For Student-T: Using a Gamma-Mixture representation

Strategy 3: Hyperparameter Tuning

Sometimes the solution lies in adjusting the sampler’s hyperparameters:

Increase total iterations: The simplest approach — give the sampler more time to explore.
Increase target acceptance rate (adapt_delta): Reduce divergent transitions (try 0.9 instead of the default 0.8 for complex models, for example).
Increase max_treedepth: Allow the sampler to take more steps per iteration.
Extend warmup/adaptation phase: Give the sampler more time to adapt to the posterior geometry.

Remember that while these adjustments may improve your diagnostic metrics, they often treat symptoms rather than underlying causes. The previous strategies (reparameterization and better proposal distributions) typically offer more fundamental solutions.

Strategy 4: Better Proposal Distributions

This solution is for function fitting processes, rather than sampling estimations of the posterior. It basically asks the question: “I’m currently here in this landscape. Where should I jump to next so that I explore the full landscape, or how do I know that the next jump is the jump I should make?” Thus, choosing a good distribution means making sure that the sampling process explores the full parameter space instead of just a specific region. A good proposal distribution should:

Have substantial probability mass where the target distribution does.
Allow the sampler to make jumps of the appropriate size.

One common choice of the proposal distribution is the Gaussian (Normal) distribution with mean μ and standard deviation σ — the scale of the distribution that we can tune to decide how far to jump from the current position to the next position. If we choose the scale for the proposal distribution to be too small, it might either take too long to explore the entire posterior or it will get stuck in a region and never explore the full distribution. But if the scale is too large, you might never get to explore some regions, jumping over them. It’s like playing ping-pong where we only reach the two edges but not the middle.

Improve Prior Specification

When all else fails, reconsider your model’s prior specifications. Vague or weakly informative priors (like uniformly distributed priors) can sometimes lead to sampling difficulties. More informative priors, when justified by domain knowledge, can help guide the sampler toward more reasonable regions of the parameter space. Sometimes, despite your best efforts, a model may remain challenging to sample effectively. In such cases, consider whether a simpler model might achieve similar inferential goals while being more computationally tractable. The best model is often not the most complex one, but the one that balances complexity with reliability. The table below shows the summary of fixing strategies for different issues.

Diagnostic Signal	Potential Issue	Recommended Fix
High R-hat	Poor mixing between chains	Increase iterations, adjust the step size
Low ESS	High autocorrelation	Reparameterization, increase adapt_delta
Non-uniform rank plots	Chains stuck in different regions	Better proposal distribution, start with multiple chains
Separated KDEs in trace plots	Chains exploring different distributions	Reparameterization
Funnel shapes in pair plots	Hierarchical model issues	Non-centered reparameterization
Disjoint clusters in pair plots	Multimodality with poor mixing	Adjusted distribution, simulated annealing

Conclusion

Assessing the quality of MCMC sampling is crucial for ensuring reliable inference. In this article, we explored key diagnostic metrics such as R-hat, ESS, rank plots, trace plots, and pair plots, discussing how each helps determine whether the sampler is performing properly.

If there’s one takeaway we want you to keep in mind it’s that you should always run diagnostics before drawing conclusions from your samples. No single metric provides a definitive answer — each serves as a tool that highlights potential issues rather than proving convergence. When problems arise, strategies such as reparameterization, hyperparameter tuning, and prior specification can help improve sampling efficiency.

By combining these diagnostics with thoughtful modeling decisions, you can ensure a more robust analysis, reducing the risk of misleading inferences due to poor sampling behavior.

References

B. Gilbert, Bob’s bees: the importance of using multiple bees (chains) to judge MCMC convergence (2018), Youtube

Chi-Feng, MCMC demo (n.d.), GitHub

D. Simpson, Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it. (2019), Statistical Modeling, Causal Inference, and Social Science

M. Taboga, Markov Chain Monte Carlo (MCMC) methods (2021), Lectures on probability theory and mathematical Statistics. Kindle Direct Publishing.

T. Wiecki, MCMC Sampling for Dummies (2024), twecki.io
Stan User’s Guide, Reparametrization (n.d.), Stan Documentation

The post Are You Sure Your Posterior Makes Sense? appeared first on Towards Data Science.

The Invisible Revolution: How Vectors Are (Re)defining Business Success

Felix Schmidt — Thu, 10 Apr 2025 20:52:15 +0000

In a world that focuses more on data, business leaders must understand vector thinking. At first, vectors may appear as complicated as algebra was in school, but they serve as a fundamental building block. Vectors are as essential as algebra for tasks like sharing a bill or computing interest. They underpin our digital systems for decision making, customer engagement, and data protection.

They represent a radically different concept of relationships and patterns. They do not simply divide data into rigid categories. Instead, they offer a dynamic, multidimensional view of the underlying connections. Like “Similar” for two customers may mean more than demographics or purchase histories. It’s their behaviors, preferences, and habits that distinctly align. Such associations can be defined and measured accurately in a vector space. But for many modern businesses, the logic is too complex. So leaders tend to fall back on old, learned, rule-based patterns instead. And back then, fraud detection, for example, still used simple rules on transaction limits. We’ve evolved to recognize patterns and anomalies.

While it might have been common to block transactions that allocate 50% of your credit card limit at once just a few years ago, we are now able to analyze your retailer-specific spend history, look at average baskets of other customers at the very same retailers, and do some slight logic checks such as the physical location of your previous spends.

So a $7,000 transaction for McDonald’s in Dubai might just not happen if you just spent $3 on a bike rental in Amsterdam. Even $20 wouldn’t work since logical vector patterns can rule out the physical distance to be valid. Instead, the $7,000 transaction for your new E-Bike at a retailer near Amsterdam’s city center may just work flawlessly. Welcome to the insight of living in a world managed by vectors.

The danger of ignoring the paradigm of vectors is huge. Not mastering algebra can lead to bad financial decisions. Similarly, not knowing vectors can leave you vulnerable as a business leader. While the average customer may stay unaware of vectors as much as an average passenger in a plane is of aerodynamics, a business leader should be at least aware of what kerosene is and how many seats are to be occupied to break even for a specific flight. You may not need to fully understand the systems you rely on. A basic understanding helps to know when to reach out to the experts. And this is exactly my aim in this little journey into the world of vectors: become aware of the basic principles and know when to ask for more to better steer and manage your business.

In the hushed hallways of research labs and tech companies, a revolution was brewing. It would change how computers understood the world. This revolution has nothing to do with processing power or storage capacity. It was all about teaching machines to understand context, meaning, and nuance in words. This uses mathematical representations called vectors. Before we can appreciate the magnitude of this shift, we first need to understand what it differs from.

Think about the way humans take in information. When we look at a cat, we don’t just process a checklist of components: whiskers, fur, four legs. Instead, our brains work through a network of relationships, contexts, and associations. We know a cat is more like a lion than a bicycle. It’s not from memorizing this fact. Our brains have naturally learned these relationships. It boils down to target_transform_sequence or equivalent. Vector representations let computers consume content in a human-like way. And we ought to understand how and why this is true. It’s as fundamental as knowing algebra in the time of an impending AI revolution.

In this brief jaunt in the vector realm, I will explain how vector-based computing works and why it’s so transformative. The code examples are only examples, so they are just for illustration and have no stand-alone functionality. You don’t have to be an engineer to understand those concepts. All you have to do is follow along, as I walk you through examples with plain language commentary explaining each one step by step, one step at a time. I don’t aim to be a world-class mathematician. I want to make vectors understandable to everyone: business leaders, managers, engineers, musicians, and others.

What are vectors, anyway?

Photo by Pete F on Unsplash

It is not that the vector-based computing journey started recently. Its roots go back to the 1950s with the development of distributed representations in cognitive science. James McClelland and David Rumelhart, among other researchers, theorized that the brain holds concepts not as individual entities. Instead, it holds them as the compiled activity patterns of neural networks. This discovery dominated the path for contemporary vector representations.

The real breakthrough was three things coming together:
The exponential growth in computational power, the development of sophisticated neural network architectures, and the availability of massive datasets for training.

It is the combination of these elements that makes vector-based systems theoretically possible and practically implementable at scale. AI as the mainstream as people got to know it (with the likes of ChatGPT e.a.) is the direct consequence of this.

To better understand, let me put this in context: Conventional computing systems work on symbols —discrete, human-readable symbols and rules. A traditional system, for instance, might represent a customer as a record:

customer = {
    'id': '12345',
    'age': 34,
    'purchase_history': ['electronics', 'books'],
    'risk_level': 'low'
}

This representation may be readable or logical, but it misses subtle patterns and relationships. In contrast, vector representations encode information within high-dimensional space where relationships arise naturally through geometric proximity. That same customer might be represented as a 384-dimensional vector where each one of these dimensions contributes to a rich, nuanced profile. Simple code allows for 2-Dimensional customer data to be transformed into vectors. Let’s take a look at how simple this just is:

from sentence_transformers import SentenceTransformer
import numpy as np

class CustomerVectorization:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        
    def create_customer_vector(self, customer_data):
        """
        Transform customer data into a rich vector representation
        that captures subtle patterns and relationships
        """
        # Combine various customer attributes into a meaningful text representation
        customer_text = f"""
        Customer profile: {customer_data['age']} year old,
        interested in {', '.join(customer_data['purchase_history'])},
        risk level: {customer_data['risk_level']}
        """
        
        # Generate base vector from text description
        base_vector = self.model.encode(customer_text)
        
        # Enrich vector with numerical features
        numerical_features = np.array([
            customer_data['age'] / 100,  # Normalized age
            len(customer_data['purchase_history']) / 10,  # Purchase history length
            self._risk_level_to_numeric(customer_data['risk_level'])
        ])
        
        # Combine text-based and numerical features
        combined_vector = np.concatenate([
            base_vector,
            numerical_features
        ])
        
        return combined_vector
    
    def _risk_level_to_numeric(self, risk_level):
        """Convert categorical risk level to normalized numeric value"""
        risk_mapping = {'low': 0.1, 'medium': 0.5, 'high': 0.9}
        return risk_mapping.get(risk_level.lower(), 0.5)

I trust that this code example has helped demonstrate how easily complex customer data can be encoded into meaningful vectors. The method seems complex at first. But, it is simple. We merge text and numerical data on customers. This gives us rich, info-dense vectors that capture each customer’s essence. What I love most about this technique is its simplicity and flexibility. Similarly to how we encoded age, purchase history, and risk levels here, you could replicate this pattern to capture any other customer attributes that boil down to the relevant base case for your use case. Just recall the credit card spending patterns we described earlier. It’s similar data being turned into vectors to have a meaning far greater than it could ever have it stayed 2-dimensional and would be used for traditional rule-based logics.

What our little code example allowed us to do is having two very suggestive representations in one semantically rich space and one in normalized value space, mapping every record to a line in a graph that has direct comparison properties.

This allows the systems to identify complex patterns and relations that traditional data structures won’t be able to reflect adequately. With the geometric nature of vector spaces, the shape of these structures tells the stories of similarities, differences, and relationships, allowing for an inherently standardized yet flexible representation of complex data.

But going from here, you will see this structure copied across other applications of vector-based customer analysis: use relevant data, aggregate it in a format we can work with, and meta representation combines heterogeneous data into a common understanding of vectors. Whether it’s recommendation systems, customer segmentation models, or predictive analytics tools, this fundamental approach to thoughtful vectorization will underpin all of it. Thus, this fundamental approach is significant to know and understand even if you consider yourself non-tech and more into the business side.

Just keep in mind — the key is considering what part of your data has meaningful signals and how to encode them in a way that preserves their relationships. It is nothing but following your business logic in another way of thinking other than algebra. A more modern, multi-dimensional way.

The Mathematics of Meaning (Kings and Queens)

Photo by Debbie Fan on Unsplash

All human communication delivers rich networks of meaning that our brains wire to make sense of automatically. These are meanings that we can capture mathematically, using vector-based computing; we can represent words in space so that they are points in a multi-dimensional word space. This geometrical treatment allows us to think in spatial terms about the abstract semantic relations we are interested in, as distances and directions.

For instance, the relationship “King is to Queen as Man is to Woman” is encoded in a vector space in such a way that the direction and distance between the words “King” and “Queen” are similar to those between the words “Man” and “Woman.”

Let’s take a step back to understand why this might be: the key component that makes this system work is word embeddings — numerical representations that encode words as vectors in a dense vector space. These embeddings are derived from examining co-occurrences of words across large snippets of text. Just as we learn that “dog” and “puppy” are related concepts by observing that they occur in similar contexts, embedding algorithms learn to embed these words close to each other in a vector space.

Word embeddings reveal their real power when we look at how they encode analogical relationships. Think about what we know about the relationship between “king” and “queen.” We can tell through intuition that these words are different in gender but share associations related to the palace, authority, and leadership. Through a wonderful property of vector space systems — vector arithmetic — this relationship can be captured mathematically.

One does this beautifully in the classic example:

vector('king') - vector('man') + vector('woman') ≈ vector('queen')

This equation tells us that if we have the vector for “king,” and we subtract out the “man” vector (we remove the concept of “male”), and then we add the “woman” vector (we add the concept of “female”), we get a new point in space very close to that of “queen.” That’s not some mathematical coincidence — it’s based on how the embedding space has arranged the meaning in a sort of structured way.

We can apply this idea of context in Python with pre-trained word embeddings:

import gensim.downloader as api

# Load a pre-trained model that contains word vectors learned from Google News
model = api.load('word2vec-google-news-300')

# Define our analogy words
source_pair = ('king', 'man')
target_word = 'woman'

# Find which word completes the analogy using vector arithmetic
result = model.most_similar(
    positive=[target_word, source_pair[0]], 
    negative=[source_pair[1]], 
    topn=1
)

# Display the result
print(f"{source_pair[0]} is to {source_pair[1]} as {target_word} is to {result[0][0]}")

The structure of this vector space exposes many basic principles:

Semantic similarity is present as spatial proximity. Related words congregate: the neighborhoods of ideas. “Dog,” “puppy,” and “canine” would be one such cluster; meanwhile, “cat,” “kitten,” and “feline” would create another cluster nearby.
Relationships between words become directions in the space. The vector from “man” to “woman” encodes a gender relationship, and other such relationships (for example, “king” to “queen” or “actor” to “actress”) typically point in the same direction.
The magnitude of vectors can carry meaning about word importance or specificity. Common words often have shorter vectors than specialized terms, reflecting their broader, less specific meanings.

Working with relationships between words in this way gave us a geometric encoding of meaning and the mathematical precision needed to reflect the nuances of natural language processing to machines. Instead of treating words as separate symbols, vector-like systems can recognize patterns, make analogies, and even uncover relationships that were never programmed.

To better grasp what was just discussed I took the liberty to have the words we mentioned before (“King, Man, Women”; “Dog, Puppy, Canine”; “Cat, Kitten, Feline”) mapped to a corresponding 2D vector. These vectors numerically represent semantic meaning.

Visualization of the before-mentioned example terms as 2D word embeddings. Showing grouped categories for explanatory purposes. Data is fabricated and axes are simplified for educational purposes.

Human-related words have high positive values on both dimensions.
Dog-related words have negative x-values and positive y-values.
Cat-related words have positive x-values and negative y-values.

Be aware, those values are fabricated by me to illustrate better. As shown in the 2D Space where the vectors are plotted, you can observe groups based on the positions of the dots representing the vectors. The three dog-related words e.g. can be clustered as the “Dog” category etc. etc.

Grasping these basic principles gives us insight into both the capabilities and limitations of modern language AI, such as large language models (LLMs). Though these systems can do amazing analogical and relational gymnastics, they are ultimately cycles of geometric patterns based on the ways that words appear in proximity to one another in a body of text. An elaborate but, by definition, partial reflection of human linguistic comprehension. As such an Llm, since based on vectors, can only generate as output what it has received as input. Although that doesn’t mean it generates only what it has been trained 1:1, we all know about the fantastic hallucination capabilities of LLMs; it means that LLMs, unless specifically instructed, wouldn’t come up with neologisms or new language to describe things. This basic understanding is still lacking for a lot of business leaders that expect LLMs to be miracle machines unknowledgeable about the underlying principles of vectors.

A Tale of Distances, Angles, and Dinner Parties

Photo by OurWhisky Foundation on Unsplash

Now, let’s assume you’re throwing a dinner party and it’s all about Hollywood and the big movies, and you want to seat people based on what they like. You could just calculate “distance” between their preferences (genres, perhaps even hobbies?) and find out who should sit together. But deciding how you measure that distance can be the difference between compelling conversations and annoyed participants. Or awkward silences. And yes, that company party flashback is repeating itself. Sorry for that!

The same is true in the world of vectors. The distance metric defines how “similar” two vectors look, and therefore, ultimately, how well your system performs to predict an outcome.

Euclidean Distance: Straightforward, but Limited

Euclidean distance measures the straight-line distance between two points in space, making it easy to understand:

Euclidean distance is fine as long as vectors are physical locations.
However, in high-dimensional spaces (like vectors representing user behavior or preferences), this metric often falls short. Differences in scale or magnitude can skew results, focusing on scale over actual similarity.

Example: Two vectors might represent your dinner guests’ preferences for how much streaming services are used:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

While their preferences align, Euclidean distance would make them seem vastly different because of the disparity in overall activity.

But in higher-dimensional spaces, such as user behavior or textual meaning, Euclidean distance becomes increasingly less informative. It overweights magnitude, which can obscure comparisons. Consider two moviegoers: one has seen 200 action movies, the other has seen 10, but they both like the same genres. Because of their sheer activity level, the second viewer would appear much less similar to the first when using Euclidean distance though all they ever watched is Bruce Willis movies.

Cosine Similarity: Focused on Direction

The cosine similarity method takes a different approach. It focuses on the angle between vectors, not their magnitudes. It’s like comparing the path of two arrows. If they point the same way, they are aligned, no matter their lengths. This shows that it’s perfect for high-dimensional data, where we care about relationships, not scale.

If two vectors point in the same direction, they’re considered similar (cosine similarity approx of 1).
When opposing (so pointing in opposite directions), they differ (cosine similarity ≈ -1).
If they’re perpendicular (at a right angle of 90° to one another), they are unrelated (cosine similarity close to 0).

This normalizing property ensures that the similarity score correctly measures alignment, regardless of how one vector is scaled in comparison to another.

Example: Returning to our streaming preferences, let’s take a look at how our dinner guest’s preferences would look like as vectors:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

Let us discuss why cosine similarity is really effective in this case. So, when we compute cosine similarity for vec1 [5, 10, 5] and vec2 [1, 2, 1], we’re essentially trying to see the angle between these vectors.

The dot product normalizes the vectors first, dividing each component by the length of the vector. This operation “cancels” the differences in magnitude:

So for vec1: Normalization gives us [0.41, 0.82, 0.41] or so.
For vec2: Which resolves to [0.41, 0.82, 0.41] after normalization we will also have it.

And now we also understand why these vectors would be considered identical with regard to cosine similarity because their normalized versions are identical!

This tells us that even though dinner guest A views more total content, the proportion they allocate to any given genre perfectly mirrors dinner guest B’s preferences. It’s like saying both your guests dedicate 20% of their time to action, 60% to drama, and 20% to comedy, no matter the total hours viewed.

It’s this normalization that makes cosine similarity particularly effective for high-dimensional data such as text embeddings or user preferences.

When dealing with data of many dimensions (think hundreds or thousands of components of a vector for various features of a movie), it is often the relative significance of each dimension corresponding to the complete profile rather than the absolute values that matter most. Cosine similarity identifies precisely this arrangement of relative importance and is a powerful tool to identify meaningful relationships in complex data.

Hiking up the Euclidian Mountain Trail

Photo by Christian Mikhael on Unsplash

In this part, we will see how different approaches to measuring similarity behave in practice, with a concrete example from the real world and some little code example. Even if you are a non-techie, the code will be easy to understand for you as well. It’s to illustrate the simplicity of it all. No fear!

How about we quickly discuss a 10-mile-long hiking trail? Two friends, Alex and Blake, write trail reviews of the same hike, but each ascribes it a different character:

The trail gained 2,000 feet in elevation over just 2 miles! Easily doable with some high spikes in between!
Alex

and

Beware, we hiked 100 straight feet up in the forest terrain at the spike! Overall, 10 beautiful miles of forest!
Blake

These descriptions can be represented as vectors:

alex_description = [2000, 2]  # [elevation_gain, trail_distance]
blake_description = [100, 10]  # [elevation_gain, trail_distance]

Let’s combine both similarity measures and see what it tells us:

import numpy as np

def cosine_similarity(vec1, vec2):
    """
    Measures how similar the pattern or shape of two descriptions is,
    ignoring differences in scale. Returns 1.0 for perfectly aligned patterns.
    """
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

def euclidean_distance(vec1, vec2):
    """
    Measures the direct 'as-the-crow-flies' difference between descriptions.
    Smaller numbers mean descriptions are more similar.
    """
    return np.linalg.norm(np.array(vec1) - np.array(vec2))

# Alex focuses on the steep part: 2000ft elevation over 2 miles
alex_description = [2000, 2]  # [elevation_gain, trail_distance]

# Blake describes the whole trail: 100ft average elevation per mile over 10 miles
blake_description = [100, 10]  # [elevation_gain, trail_distance]

# Let's see how different these descriptions appear using each measure
print("Comparing how Alex and Blake described the same trail:")
print("\nEuclidean distance:", euclidean_distance(alex_description, blake_description))
print("(A larger number here suggests very different descriptions)")

print("\nCosine similarity:", cosine_similarity(alex_description, blake_description))
print("(A number close to 1.0 suggests similar patterns)")

# Let's also normalize the vectors to see what cosine similarity is looking at
alex_normalized = alex_description / np.linalg.norm(alex_description)
blake_normalized = blake_description / np.linalg.norm(blake_description)

print("\nAlex's normalized description:", alex_normalized)
print("Blake's normalized description:", blake_normalized)

So now, running this code, something magical happens:

Comparing how Alex and Blake described the same trail:

Euclidean distance: 8.124038404635959
(A larger number here suggests very different descriptions)

Cosine similarity: 0.9486832980505138
(A number close to 1.0 suggests similar patterns)

Alex's normalized description: [0.99975 0.02236]
Blake's normalized description: [0.99503 0.09950]

This output shows why, depending on what you are measuring, the same trail may appear different or similar.

The large Euclidean distance (8.12) suggests these are very different descriptions. It’s understandable that 2000 is a lot different from 100, and 2 is a lot different from 10. It’s like taking the raw difference between these numbers without understanding their meaning.

But the high Cosine similarity (0.95) tells us something more interesting: both descriptions capture a similar pattern.

If we look at the normalized vectors, we can see it, too; both Alex and Blake are describing a trail in which elevation gain is the prominent feature. The first number in each normalized vector (elevation gain) is much larger relative to the second (trail distance). Either that or elevating them both and normalizing based on proportion — not volume — since they both share the same trait defining the trail.

Perfectly true to life: Alex and Blake hiked the same trail but focused on different parts of it when writing their review. Alex focused on the steeper section and described a 100-foot climb, and Blake described the profile of the entire trail, averaged to 200 feet per mile over 10 miles. Cosine similarity identifies these descriptions as variations of the same basic trail pattern, whereas Euclidean distance regards them as completely different trails.

This example highlights the need to select the appropriate similarity measure. Normalizing and taking cosine similarity gives many meaningful correlations that are missed by just taking distances like Euclidean in real use cases.

Real-World Impacts of Metric Choices

Photo by fabio on Unsplash

The metric you pick doesn’t merely change the numbers; it influences the results of complex systems. Here’s how it breaks down in various domains:

In Recommendation Engines: When it comes to cosine similarity, we can group users who have the same tastes, even if they are doing different amounts of overall activity. A streaming service could use this to recommend movies that align with a user’s genre preferences, regardless of what is popular among a small subset of very active viewers.
In Document Retrieval: When querying a database of documents or research papers, cosine similarity ranks documents according to whether their content is similar in meaning to the user’s query, rather than their text length. This enables systems to retrieve results that are contextually relevant to the query, even though the documents are of a wide range of sizes.
In Fraud Detection: Patterns of behavior are often more important than pure numbers. Cosine similarity can be used to detect anomalies in spending habits, as it compares the direction of the transaction vectors — type of merchant, time of day, transaction amount, etc. — rather than the absolute magnitude.

And these differences matter because they give a sense of how systems “think”. Let’s get back to that credit card example one more time: It might, for example, identify a high-value $7,000 transaction for your new E-Bike as suspicious using Euclidean distance — even if that transaction is normal for you given you have an average spent of $20,000 a mont.

A cosine-based system, on the other hand, understands that the transaction is consistent with what the user typically spends their money on, thus avoiding unnecessary false notifications.

But measures like Euclidean distance and cosine similarity are not merely theoretical. They’re the blueprints on which real-world systems stand. Whether it’s recommendation engines or fraud detection, the metrics we choose will directly impact how systems make sense of relationships in data.

Vector Representations in Practice: Industry Transformations

Photo by Louis Reed on Unsplash

This ability for abstraction is what makes vector representations so powerful — they transform complex and abstract field data into concepts that can be scored and actioned. These insights are catalyzing fundamental transformations in business processes, decision-making, and customer value delivery across sectors.

Next, we will explore the solution use cases we are highlighting as concrete examples to see how vectors are freeing up time to solve big problems and creating new opportunities that have a big impact. I picked an industry to show what vector-based approaches to a challenge can achieve, so here is a healthcare example from a clinical setting. Why? Because it matters to us all and is rather easy to relate to than digging into the depths of the finance system, insurance, renewable energy, or chemistry.

Healthcare Spotlight: Pattern Recognition in Complex Medical Data

The healthcare industry poses a perfect storm of challenges that vector representations can uniquely solve. Think of the complexities of patient data: medical histories, genetic information, lifestyle factors, and treatment outcomes all interact in nuanced ways that traditional rule-based systems are incapable of capturing.

At Massachusetts General Hospital, researchers implemented a vector-based early detection system for sepsis, a condition in which every hour of early detection increases the chances of survival by 7.6% (see the full study at pmc.ncbi.nlm.nih.gov/articles/PMC6166236/).

In this new methodology, spontaneous neutrophil velocity profiles (SVP) are used to describe the movement patterns of neutrophils from a drop of blood. We won’t get too medically detailed here, because we’re vector-focused today, but a neutrophil is an immune cell that is kind of a first responder in what the body uses to fight off infections.

The system then encodes each neutrophil’s motion as a vector that captures not just its magnitude (i.e., speed), but also its direction. So they converted biological patterns to high-dimensional vector spaces; thus, they got subtle differences and showed that healthy individuals and sepsis patients exhibited statistically significant differences in movement. Then, these numeric vectors were processed with the help of a Machine Learning model that was trained to detect early signs of sepsis. The result was a diagnostic tool that reached impressive sensitivity (97%) and specificity (98%) to achieve a rapid and accurate identification of this fatal condition — probably with the cosine similarity (the paper doesn’t go into much detail, so this is pure speculation, but it would be the most suitable) that we just learned about a moment ago.

This is just one example of how medical data can be encoded into its vector representations and turned into malleable, actionable insights. This approach made it possible to re-contextualize complex relationships and, along with tread-based machine learning, worked around the limitations of previous diagnostic modalities and proved to be a potent tool for clinicians to save lives. It’s a powerful reminder that Vectors aren’t merely theoretical constructs — they’re practical, life-saving solutions that are powering the future of healthcare as much as your credit card risk detection software and hopefully also your business.

Lead and understand, or face disruption. The naked truth.

Photo by Hunters Race on Unsplash

With all you have read about by now: Think of a decision as small as the decision about the metrics under which data relationships are evaluated. Leaders risk making assumptions that are subtle yet disastrous. You are basically using algebra as a tool, and while getting some result, you cannot know if it is right or not: making leadership decisions without understanding the fundamentals of vectors is like calculating using a calculator but not knowing what formulas you are using.

The good news is this doesn’t mean that business leaders have to become data scientists. Vectors are delightful because, once the core ideas have been grasped, they become very easy to work with. An understanding of a handful of concepts (for example, how vectors encode relationships, why distance metrics are important, and how embedding models function) can fundamentally change how you make high-level decisions. These tools will help you ask better questions, work with technical teams more effectively, and make sound decisions about the systems that will govern your business.

The returns on this small investment in comprehension are huge. There is much talk about personalization. Yet, few organizations use vector-based thinking in their business strategies. It could help them leverage personalization to its full potential. Such an approach would delight customers with tailored experiences and build loyalty. You could innovate in areas like fraud detection and operational efficiency, leveraging subtle patterns in data that traditional ones miss — or perhaps even save lives, as described above. Equally important, you can avoid expensive missteps that happen when leaders defer to others for key decisions without understanding what they mean.

The truth is, vectors are here now, driving a vast majority of all the hyped AI technology behind the scenes to help create the world we navigate in today and tomorrow. Companies that do not adapt their leadership to think in vectors risk falling behind a competitive landscape that becomes ever more data-driven. One who adopts this new paradigm will not just survive but will prosper in an age of never-ending AI innovation.

Now is the moment to act. Start to view the world through vectors. Study their tongue, examine their doctrine, and ask how the new could change your tactics and your lodestars. Much in the way that algebra became an essential tool for writing one’s way through practical life challenges, vectors will soon serve as the literacy of the data age. Actually they do already. It is the future of which the powerful know how to take control. The question is not if vectors will define the next era of businesses; it is whether you are prepared to lead it.

The post The Invisible Revolution: How Vectors Are (Re)defining Business Success appeared first on Towards Data Science.

Time Series Forecasting Made Simple (Part 1): Decomposition and Baseline Models

Nikhil Dasari — Wed, 09 Apr 2025 19:53:52 +0000

I used to avoid time series analysis. Every time I took an online course, I’d see a module titled “Time Series Analysis” with subtopics like Fourier Transforms, autocorrelation functions and other intimidating terms. I don’t know why, but I always found a reason to avoid it.

But here’s what I’ve learned: any complex topic becomes manageable when we start from the basics and focus on understanding the intuition behind it. That’s exactly what this blog series is about : making time series feel less like a maze and more like a conversation with your data over time.

We understand complex topics much more easily when they’re explained through real-world examples. That’s exactly how I’ll approach this series.

In each post, we’ll work with a simple dataset and explore what’s needed from a time series perspective. We’ll build intuition around each concept, understand why it matters, and implement it step by step on the data.

Time Series Analysis is the process of understanding, modeling and Forecasting data that is observed over time. It involves identifying patterns such as trends, seasonality and noise using past observations to make informed predictions about future values.

Let’s start by considering a dataset named Daily Minimum Temperatures in Melbourne (open license). This dataset contains daily records of the lowest temperature (in Celsius) observed in Melbourne, Australia, over a 10-year period from 1981 to 1990. Each entry includes just two columns:

Date: The calendar day (from 1981-01-01 to 1990-12-31)
Temp: The minimum temperature recorded on that day

You’ve probably heard of models like ARIMA, SARIMA or Exponential Smoothing. But before we go there, it’s a good idea to try out some simple baseline models first, to see how well a basic approach performs on our data.

While there are many types of baseline models used in time series forecasting, here we’ll focus on the three most essential ones, which are simple, effective, and widely applicable across industries.

Naive Forecast: Assumes the next value will be the same as the last observed one.
Seasonal Naive Forecast: Assumes the value will repeat from the same point last season (e.g., last week or last month).
Moving Average: Takes the average of the last n points.

You might be wondering, why use baseline models at all? Why not just go straight to the well-known forecasting methods like ARIMA or SARIMA?

Let’s consider a shop owner who wants to forecast next month’s sales. By applying a moving average baseline model, they can estimate next month’s sales as the average of previous months. This simple approach might already deliver around 80% accuracy — good enough for planning and inventory decisions.

Now, if we switch to a more advanced model like ARIMA or SARIMA, we might increase accuracy to around 85%. But the key question is: is that extra 5% worth the additional time, effort and resources? In this case, the baseline model does the job.

In fact, in most everyday business scenarios, baseline models are sufficient. We typically turn to classical models like ARIMA or SARIMA in high-impact industries such as finance or energy, where even a small improvement in accuracy can have a significant financial or operational impact. Even then, a baseline model is usually applied first — not only to provide quick insights but also to act as a benchmark that more complex models must outperform.

Okay, now that we’re ready to implement some baseline models, there’s one key thing we need to understand first:
Every time series is made up of three main components — trend, seasonality and residuals.

Time series decomposition separates data into trend, seasonality and residuals (noise), helping us uncover the true patterns beneath the surface. This understanding guides the choice of forecasting models and improves accuracy. It’s also a vital first step before building both simple and advanced forecasting solutions.

Trend
This is the overall direction your data is moving in over time — going up, down or staying flat.
Example: Steady decrease in monthly cigarette sales.

Seasonality
These are the patterns that repeat at regular intervals — daily, weekly, monthly or yearly.
Example: Cool drinks sales in summer.

Residuals (Noise)
This is the random “leftover” part of the data, the unpredictable ups and downs that can’t be explained by trend or seasonality.
Example: A one-time car purchase showing up in your monthly expense pattern.

Now that we understand the key components of a time series, let’s put that into practice using a real dataset: Daily Minimum Temperatures in Melbourne, Australia.

We’ll use Python to decompose the time series into its trend, seasonality, and residual components so we can better understand its structure and choose an appropriate baseline model.

Code:

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# Load the dataset
df = pd.read_csv("minimum daily temperatures data.csv")

# Convert 'Date' to datetime and set as index
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.set_index('Date', inplace=True)

# Set a regular daily frequency and fill missing values using forward fill
df = df.asfreq('D')
df['Temp'].fillna(method='ffill', inplace=True)

# Decompose the daily series (365-day seasonality for yearly patterns)
decomposition = seasonal_decompose(df['Temp'], model='additive', period=365)

# Plot the decomposed components
decomposition.plot()
plt.suptitle('Decomposition of Daily Minimum Temperatures (Daily)', fontsize=14)
plt.tight_layout()
plt.show()

Output:

Decomposition of daily temperatures showing trend, seasonal cycles and random fluctuations.

The decomposition plot clearly shows a strong seasonal pattern that repeats each year, along with a mild trend that shifts over time. The residual component captures the random noise that isn’t explained by trend or seasonality.

In the code earlier, you might have noticed that I used an additive model for decomposing the Time Series. But what exactly does that mean — and why is it the right choice for this dataset?

Let’s break it down.
In an additive model, we assume Trend, Seasonality and Residuals (Noise) combine linearly, like this:
Y = T + S + R

Where:
Y is the actual value at time t
T is the trend
S is the seasonal component
R is the residual (random noise)

This means we’re treating the observed value as the sum of the parts, each component contributes independently to the final output.

I chose the additive model because when I looked at the pattern in daily minimum temperatures, I noticed something important:

The line plot above shows the daily minimum temperatures from 1981 to 1990. We can clearly see a strong seasonal cycle that repeats each year, colder temperatures in winter, warmer in summer.

Importantly, the amplitude of these seasonal swings stays relatively consistent over the years. For example, the temperature difference between summer and winter doesn’t appear to grow or shrink over time. This stability in seasonal variation is a key sign that the additive model is appropriate for decomposition, since the seasonal component appears to be independent of any trend.

We use an additive model when the trend is relatively stable and does not amplify or distort the seasonal pattern, and when the seasonality stays within a consistent range over time, even if there are minor fluctuations.

Now that we understand how the additive model works, let’s explore the multiplicative model — which is often used when the seasonal effect scales with the trend which will also help us understand the additive model more clearly.

Consider a household’s electricity consumption. Suppose the household uses 20% more electricity in summer compared to winter. That means the seasonal effect isn’t a fixed number — it’s a proportion of their baseline usage.

Let’s see how this looks with real numbers:

In 2021, the household used 300 kWh in winter and 360 kWh in summer (20% more than winter).

In 2022, their winter consumption increased to 330 kWh, and summer usage rose to 396 kWh (still 20% more than winter).

In both years, the seasonal difference grows with the trend from +60 kWh in 2021 to +66 kWh in 2022 even though the percentage increase stays the same. This is exactly the kind of behavior that a multiplicative model captures well.

In mathematical terms:
Y = T ×S ×R
Where:
Y: Observed value
T: Trend component
S: Seasonal component
R: Residual (noise)

By looking at the decomposition plot, we can figure out whether an additive or multiplicative model fits our data better.

There are also other powerful decomposition tools available, which I’ll be covering in one of my upcoming blog posts.Now that we have a clear understanding of additive and multiplicative models, let’s shift our focus to applying a baseline model that fits this dataset.

Based on the decomposition plot, we can see a strong seasonal pattern in the data, which suggests that a Seasonal Naive model might be a good fit for this time series.

This model assumes that the value at a given time will be the same as it was in the same period of the previous season — making it a simple yet effective choice when seasonality is dominant and consistent. For example, if temperatures typically follow the same yearly cycle, then the forecast for July 1st, 1990, would simply be the temperature recorded on July 1st, 1989.

Code:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
df = pd.read_csv("minimum daily temperatures data.csv")

# Convert 'Date' column to datetime and set as index
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.set_index('Date', inplace=True)

# Ensure regular daily frequency and fill missing values
df = df.asfreq('D')
df['Temp'].fillna(method='ffill', inplace=True)

# Step 1: Create the Seasonal Naive Forecast
seasonal_period = 365  # Assuming yearly seasonality for daily data
# Create the Seasonal Naive forecast by shifting the temperature values by 365 days
df['Seasonal_Naive'] = df['Temp'].shift(seasonal_period)

# Step 2: Plot the actual vs forecasted values
# Plot the last 2 years (730 days) of data to compare
plt.figure(figsize=(12, 5))
plt.plot(df['Temp'][-730:], label='Actual')
plt.plot(df['Seasonal_Naive'][-730:], label='Seasonal Naive Forecast', linestyle='--')
plt.title('Seasonal Naive Forecast vs Actual Temperatures')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.tight_layout()
plt.show()

# Step 3: Evaluate using MAPE (Mean Absolute Percentage Error)
# Use the last 365 days for testing
test = df[['Temp', 'Seasonal_Naive']].iloc[-365:].copy()
test.dropna(inplace=True)

# MAPE Calculation
mape = np.mean(np.abs((test['Temp'] - test['Seasonal_Naive']) / test['Temp'])) * 100
print(f"MAPE (Seasonal Naive Forecast): {mape:.2f}%")

Output:

Seasonal Naive Forecast vs. Actual Temperatures (1989–1990)

To keep the visualization clear and focused, we’ve plotted the last two years of the dataset (1989–1990) instead of all 10 years.

This plot compares the actual daily minimum temperatures in Melbourne with the values predicted by the Seasonal Naive model, which simply assumes that each day’s temperature will be the same as it was on the same day one year ago.

As seen in the plot, the Seasonal Naive forecast captures the broad shape of the seasonal cycles quite well — it mirrors the rise and fall of temperatures throughout the year. However, it doesn’t capture day-to-day variations, nor does it respond to slight shifts in seasonal timing. This is expected, as the model is designed to repeat the previous year’s pattern exactly, without adjusting for trend or noise.

To evaluate how well this model performs, we calculate the Mean Absolute Percentage Error (MAPE) over the final 365 days of the dataset (i.e., 1990). We only use this period because the Seasonal Naive forecast needs a full year of historical data before it can begin making predictions.

Mean Absolute Percentage Error (MAPE) is a commonly used metric to evaluate the accuracy of forecasting models. It measures the average absolute difference between the actual and predicted values, expressed as a percentage of the actual values.

In time series forecasting, we typically evaluate model performance on the most recent or target time period — not on the middle years. This reflects how forecasts are used in the real world: we build models on historical data to predict what’s coming next.

That’s why we calculate MAPE only on the final 365 days of the dataset — this simulates forecasting for a future and gives us a realistic measure of how well the model would perform in practice.

A MAPE of 28.23%, which gives us a baseline level of forecasting error. Any model we build next — whether it’s customized or more advanced, should aim to outperform this benchmark.

A MAPE of 28.23% means that, on average, the model’s predictions were 28.23% off from the actual daily temperature values over the last year.

In other words, if the true temperature on a given day was 10°C, the Seasonal Naïve forecast might have been around 7.2°C or 12.8°C, reflecting a 28% deviation.

I’ll dive deeper into evaluation metrics in a future post.

In this post, we laid the foundation for time series forecasting by understanding how real-world data can be broken down into trend, seasonality, and residuals through decomposition. We explored the difference between additive and multiplicative models, implemented the Seasonal Naive baseline forecast and evaluated its performance using MAPE.

While the Seasonal Naive model is simple and intuitive, it comes with limitations especially for this dataset. It assumes that the temperature on any given day is identical to the same day last year. But as the plot and MAPE of 28.23% showed, this assumption doesn’t hold perfectly. The data displays slight shifts in seasonal patterns and long-term variations that the model fails to capture.

In the next part of this series, we’ll go further. We’ll explore how to customize a baseline model, compare it to the Seasonal Naive approach and evaluate which one performs better using error metrics like MAPE, MAE and RMSE.

We’ll also begin building the foundation needed to understand more advanced models like ARIMA including key concepts such as:

Stationarity
Autocorrelation and Partial Autocorrelation
Differencing
Lag-based modeling (AR and MA terms)

Part 2 will dive into these topics in more detail, starting with custom baselines and ending with the foundations of ARIMA.

Thanks for reading. I hope you found this post helpful and insightful.

The post Time Series Forecasting Made Simple (Part 1): Decomposition and Baseline Models appeared first on Towards Data Science.

Mining Rules from Data

Mariya Mansurova — Wed, 09 Apr 2025 16:54:40 +0000

Working with products, we might face a need to introduce some “rules”. Let me explain what I mean by “rules” in practical examples:

Imagine that we’re seeing a massive wave of fraud in our product, and we want to restrict onboarding for a particular segment of customers to lower this risk. For example, we found out that the majority of fraudsters had specific user agents and IP addresses from certain countries.
Another option is to send coupons to customers to use in our online shop. However, we would like to treat only customers who are likely to churn since loyal users will return to the product anyway. We might figure out that the most feasible group is customers who joined less than a year ago and decreased their spending by 30%+ last month.
Transactional businesses often have a segment of customers where they are losing money. For example, a bank customer passed the verification and regularly reached out to customer support (so generated onboarding and servicing costs) while doing almost no transactions (so not generating any revenue). The bank might introduce a small monthly subscription fee for customers with less than 1000$ in their account since they are likely non-profitable.

Of course, in all these cases, we might have used a complex Machine Learning model that would take into account all the factors and predict the probability (either of a customer being a fraudster or churning). Still, under some circumstances, we might prefer just a set of static rules for the following reasons:

The speed and complexity of implementation. Deploying an ML model in production takes time and effort. If you are experiencing a fraud wave right now, it might be more feasible to go live with a set of static rules that can be implemented quickly and then work on a comprehensive solution.
Interpretability. ML models are black boxes. Even though we might be able to understand at a high level how they work and what features are the most important ones, it’s challenging to explain them to customers. In the example of subscription fees for non-profitable customers, it’s important to share a set of transparent rules with customers so that they can understand the pricing.
Compliance. Some industries, like finance or healthcare, might require auditable and rule-based decisions to meet compliance requirements.

In this article, I want to show you how we can solve business problems using such rules. We will take a practical example and go really deep into this topic:

we will discuss which models we can use to mine such rules from data,
we will build a Decision Tree Classifier from scratch to learn how it works,
we will fit the sklearn Decision Tree Classifier model to extract the rules from the data,
we will learn how to parse the Decision Tree structure to get the resulting segments,
finally, we will explore different options for category encoding, since the sklearn implementation doesn’t support categorical variables.

We have lots of topics to cover, so let’s jump into it.

Case

As usual, it’s easier to learn something with a practical example. So, let’s start by discussing the task we will be solving in this article.

We will work with the Bank Marketing dataset (CC BY 4.0 license). This dataset contains data about the direct marketing campaigns of a Portuguese banking institution. For each customer, we know a bunch of features and whether they subscribed to a term deposit (our target).

Our business goal is to maximise the number of conversions (subscriptions) with limited operational resources. So, we can’t call the whole user base, and we want to reach the best outcome with the resources we have.

The first step is to look at the data. So, let’s load the data set.

import pandas as pd
pd.set_option('display.max_colwidth', 5000)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

df = pd.read_csv('bank-full.csv', sep = ';')
df = df.drop(['duration', 'campaign'], axis = 1)
# removed columns related to the current marketing campaign, 
# since they introduce data leakage

df.head()

We know quite a lot about the customers, including personal data (such as job type or marital status) and their previous behaviour (such as whether they have a loan or their average yearly balance).

Image by author

The next step is to select a machine-learning model. There are two classes of models that are usually used when we need something easily interpretable:

decision trees,
linear or logistic regression.

Both options are feasible and can give us good models that can be easily implemented and interpreted. However, in this article, I would like to stick to the decision tree model because it produces actual rules, while logistic regression will give us probability as a weighted sum of features.

Data Preprocessing

As we’ve seen in the data, there are lots of categorical variables (such as education or marital status). Unfortunately, the sklearn decision tree implementation can’t handle categorical data, so we need to do some preprocessing.

Let’s start by transforming yes/no flags into integers.

for p in ['default', 'housing', 'loan', 'y']:
    df[p] = df[p].map(lambda x: 1 if x == 'yes' else 0)

The next step is to transform the month variable. We can use one-hot encoding for months, introducing flags like month_jan , month_feb , etc. However, there might be seasonal effects, and I think it would be more reasonable to convert months into integers following their order.

month_map = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
# I saved 5 mins by asking ChatGPT to do this mapping

df['month'] = df.month.map(lambda x: month_map[x] if x in month_map else x)

For all other categorical variables, let’s use one-hot encoding. We will discuss different strategies for category encoding later, but for now, let’s stick to the default approach.

The easiest way to do one-hot encoding is to leverage get_dummies function in pandas.

fin_df = pd.get_dummies(
  df, columns=['job', 'marital', 'education', 'poutcome', 'contact'], 
  dtype = int, # to convert to flags 0/1
  drop_first = False # to keep all possible values
)

This function transforms each categorical variable into a separate 1/0 column for each possible. We can see how it works for poutcome column.

fin_df.merge(df[['id', 'poutcome']])\
    .groupby(['poutcome', 'poutcome_unknown', 'poutcome_failure', 
      'poutcome_other', 'poutcome_success'], as_index = False).y.count()\
    .rename(columns = {'y': 'cases'})\
    .sort_values('cases', ascending = False)

Image by author

Our data is now ready, and it’s time to discuss how decision tree classifiers work.

Decision Tree Classifier: Theory

In this section, we’ll explore the theory behind the Decision Tree Classifier and build the algorithm from scratch. If you’re more interested in a practical example, feel free to skip ahead to the next part.

The easiest way to understand the decision tree model is to look at an example. So, let’s build a simple model based on our data. We will use DecisionTreeClassifier from sklearn.

feature_names = fin_df.drop(['y'], axis = 1).columns
model = sklearn.tree.DecisionTreeClassifier(
  max_depth = 2, min_samples_leaf = 1000)
model.fit(fin_df[feature_names], fin_df['y'])

The next step is to visualise the tree.

dot_data = sklearn.tree.export_graphviz(
    model, out_file=None, feature_names = feature_names, filled = True, 
    proportion = True, precision = 2 
    # to show shares of classes instead of absolute numbers
)

graph = graphviz.Source(dot_data)
graph

Image by author

So, we can see that the model is straightforward. It’s a set of binary splits that we can use as heuristics.

Let’s figure out how the classifier works under the hood. As usual, the best way to understand the model is to build the logic from scratch.

The cornerstone of any problem is the optimisation function. By default, in the decision tree classifier, we’re optimising the Gini coefficient. Imagine getting one random item from the sample and then the other. The Gini coefficient would equal the probability of the situation when these items are from different classes. So, our goal will be minimising the Gini coefficient.

In the case of just two classes (like in our example, where marketing intervention was either successful or not), the Gini coefficient is defined just by one parameter p , where p is the probability of getting an item from one of the classes. Here’s the formula:

\[\textbf{gini}(\textsf{p}) = 1 – \textsf{p}^2 – (1 – \textsf{p})^2 = 2 * \textsf{p} * (1 – \textsf{p}) \]

If our classification is ideal and we are able to separate the classes perfectly, then the Gini coefficient will be equal to 0. The worst-case scenario is when p = 0.5 , then the Gini coefficient is also equal to 0.5.

With the formula above, we can calculate the Gini coefficient for each leaf of the tree. To calculate the Gini coefficient for the whole tree, we need to combine the Gini coefficients of binary splits. For that, we can just get a weighted sum:

\[\textbf{gini}_{\textsf{total}} = \textbf{gini}_{\textsf{left}} * \frac{\textbf{n}_{\textsf{left}}}{\textbf{n}_{\textsf{left}} + \textbf{n}_{\textsf{right}}} + \textbf{gini}_{\textsf{right}} * \frac{\textbf{n}_{\textsf{right}}}{\textbf{n}_{\textsf{left}} + \textbf{n}_{\textsf{right}}}\]

Now that we know what value we’re optimising, we only need to define all possible binary splits, iterate through them and choose the best option.

Defining all possible binary splits is also quite straightforward. We can do it one by one for each parameter, sort possible values, and pick up thresholds between them. For example, for months (integer from 1 to 12).

Image by author

Let’s try to code it and see whether we will come to the same result. First, we will define functions that calculate the Gini coefficient for one dataset and the combination.

def get_gini(df):
    p = df.y.mean()
    return 2*p*(1-p)

print(get_gini(fin_df)) 
# 0.2065
# close to what we see at the root node of Decision Tree

def get_gini_comb(df1, df2):
    n1 = df1.shape[0]
    n2 = df2.shape[0]

    gini1 = get_gini(df1)
    gini2 = get_gini(df2)
    return (gini1*n1 + gini2*n2)/(n1 + n2)

The next step is to get all possible thresholds for one parameter and calculate their Gini coefficients.

import tqdm
def optimise_one_parameter(df, param):
    tmp = []
    possible_values = list(sorted(df[param].unique()))
    print(param)

    for i in tqdm.tqdm(range(1, len(possible_values))): 
        threshold = (possible_values[i-1] + possible_values[i])/2
        gini = get_gini_comb(df[df[param] <= threshold], 
          df[df[param] > threshold])
        tmp.append(
            {'param': param, 
            'threshold': threshold, 
            'gini': gini, 
            'sizes': (df[df[param] <= threshold].shape[0], df[df[param] > threshold].shape[0]))
            }
        )
    return pd.DataFrame(tmp)

The final step is to iterate through all features and calculate all possible splits.

tmp_dfs = []
for feature in feature_names:
    tmp_dfs.append(optimise_one_parameter(fin_df, feature))
opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', asceding = True).head(5)

Image by author

Wonderful, we’ve got the same result as in our DecisionTreeClassifier model. The optimal split is whether poutcome = success or not. We’ve reduced the Gini coefficient from 0.2065 to 0.1872.

To continue building the tree, we need to repeat the process recursively. For example, going down for the poutcome_success <= 0.5 branch:

tmp_dfs = []
for feature in feature_names:
    tmp_dfs.append(optimise_one_parameter(
      fin_df[fin_df.poutcome_success <= 0.5], feature))

opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', ascending = True).head(5)

Image by author

The only question we still need to discuss is the stopping criteria. In our initial example, we’ve used two conditions:

max_depth = 2 — it just limits the maximum depth of the tree,
min_samples_leaf = 1000 prevents us from getting leaf nodes with less than 1K samples. Because of this condition, we’ve chosen a binary split by contact_unknown even though age led to a lower Gini coefficient.

Also, I usually limit the min_impurity_decrease that prevent us from going further if the gains are too small. By gains, we mean the decrease of the Gini coefficient.

So, we’ve understood how the Decision Tree Classifier works, and now it’s time to use it in practice.

If you’re interested to see how Decision Tree Regressor works in all detail, you can look it up in my previous article.

Decision Trees: practice

We’ve already built a simple tree model with two layers, but it’s definitely not enough since it’s too simple to get all the insights from the data. Let’s train another Decision Tree by limiting the number of samples in leaves and decreasing impurity (reduction of Gini coefficient).

model = sklearn.tree.DecisionTreeClassifier(
  min_samples_leaf = 1000, min_impurity_decrease=0.001)
model.fit(fin_df[features], fin_df['y'])

dot_data = sklearn.tree.export_graphviz(
    model, out_file=None, feature_names = features, filled = True, 
    proportion = True, precision=2, impurity = True)

graph = graphviz.Source(dot_data)

# saving graph to png file
png_bytes = graph.pipe(format='png')
with open('decision_tree.png','wb') as f:
    f.write(png_bytes)

Image by author

That’s it. We’ve got our rules to split customers into groups (leaves). Now, we can iterate through groups and see which groups of customers we want to contact. Even though our model is relatively small, it’s daunting to copy all conditions from the image. Luckily, we can parse the tree structure and get all the groups from the model.

The Decision Tree classifier has an attribute tree_ that will allow us to get access to low-level attributes of the tree, such as node_count .

n_nodes = model.tree_.node_count
print(n_nodes)
# 13

The tree_ variable also stores the entire tree structure as parallel arrays, where the i_th element of each array stores the information about the node i. For the root i equals to 0.

Here are the arrays we have to represent the tree structure:

children_left and children_right — IDs of left and right nodes, respectively; if the node is a leaf, then -1.
feature — feature used to split the node i .
threshold — threshold value used for the binary split of the node i .
n_node_samples — number of training samples that reached the node i .
values — shares of samples from each class.

Let’s save all these arrays.

children_left = model.tree_.children_left
# [ 1,  2,  3,  4,  5,  6, -1, -1, -1, -1, -1, -1, -1]
children_right = model.tree_.children_right
# [12, 11, 10,  9,  8,  7, -1, -1, -1, -1, -1, -1, -1]
features = model.tree_.feature
# [30, 34,  0,  3,  6,  6, -2, -2, -2, -2, -2, -2, -2]
thresholds = model.tree_.threshold
# [ 0.5,  0.5, 59.5,  0.5,  6.5,  2.5, -2. , -2. , -2. , -2. , -2. , -2. , -2. ]
num_nodes = model.tree_.n_node_samples
# [45211, 43700, 30692, 29328, 14165,  4165,  2053,  2112, 10000, 
#  15163,  1364, 13008,  1511] 
values = model.tree_.value
# [[[0.8830152 , 0.1169848 ]],
# [[0.90135011, 0.09864989]],
# [[0.87671054, 0.12328946]],
# [[0.88550191, 0.11449809]],
# [[0.8530886 , 0.1469114 ]],
# [[0.76686675, 0.23313325]],
# [[0.87043351, 0.12956649]],
# [[0.66619318, 0.33380682]],
# [[0.889     , 0.111     ]],
# [[0.91578184, 0.08421816]],
# [[0.68768328, 0.31231672]],
# [[0.95948647, 0.04051353]],
# [[0.35274653, 0.64725347]]]

It will be more convenient for us to work with a hierarchical view of the tree structure, so let’s iterate through all nodes and, for each node, save the parent node ID and whether it was a right or left branch.

hierarchy = {}

for node_id in range(n_nodes):
  if children_left[node_id] != -1: 
    hierarchy[children_left[node_id]] = {
      'parent': node_id, 
      'condition': 'left'
    }
  
  if children_right[node_id] != -1:
      hierarchy[children_right[node_id]] = {
       'parent': node_id, 
       'condition': 'right'
  }

print(hierarchy)
# {1: {'parent': 0, 'condition': 'left'},
# 12: {'parent': 0, 'condition': 'right'},
# 2: {'parent': 1, 'condition': 'left'},
# 11: {'parent': 1, 'condition': 'right'},
# 3: {'parent': 2, 'condition': 'left'},
# 10: {'parent': 2, 'condition': 'right'},
# 4: {'parent': 3, 'condition': 'left'},
# 9: {'parent': 3, 'condition': 'right'},
# 5: {'parent': 4, 'condition': 'left'},
# 8: {'parent': 4, 'condition': 'right'},
# 6: {'parent': 5, 'condition': 'left'},
# 7: {'parent': 5, 'condition': 'right'}}

The next step is to filter out the leaf nodes since they are terminal and the most interesting for us as they define the customer segments.

leaves = []
for node_id in range(n_nodes):
    if (children_left[node_id] == -1) and (children_right[node_id] == -1):
        leaves.append(node_id)
print(leaves)
# [6, 7, 8, 9, 10, 11, 12]
leaves_df = pd.DataFrame({'node_id': leaves})

The next step is to determine all the conditions applied to each group since they will define our customer segments. The first function get_condition will give us the tuple of feature, condition type and threshold for a node.

def get_condition(node_id, condition, features, thresholds, feature_names):
    # print(node_id, condition)
    feature = feature_names[features[node_id]]
    threshold = thresholds[node_id]
    cond = '>' if condition == 'right'  else '<='
    return (feature, cond, threshold)

print(get_condition(0, 'left', features, thresholds, feature_names)) 
# ('poutcome_success', '<=', 0.5)

print(get_condition(0, 'right', features, thresholds, feature_names))
# ('poutcome_success', '>', 0.5)

The next function will allow us to recursively go from the leaf node to the root and get all the binary splits.

def get_decision_path_rec(node_id, decision_path, hierarchy):
  if node_id == 0:
    yield decision_path 
  else:
    parent_id = hierarchy[node_id]['parent']
    condition = hierarchy[node_id]['condition']
    for res in get_decision_path_rec(parent_id, decision_path + [(parent_id, condition)], hierarchy):
        yield res

decision_path = list(get_decision_path_rec(12, [], hierarchy))[0]
print(decision_path) 
# [(0, 'right')]

fmt_decision_path = list(map(
  lambda x: get_condition(x[0], x[1], features, thresholds, feature_names), 
  decision_path))
print(fmt_decision_path)
# [('poutcome_success', '>', 0.5)]

Let’s save the logic of executing the recursion and formatting into a wrapper function.

def get_decision_path(node_id, features, thresholds, hierarchy, feature_names):
  decision_path = list(get_decision_path_rec(node_id, [], hierarchy))[0]
  return list(map(lambda x: get_condition(x[0], x[1], features, thresholds, 
    feature_names), decision_path))

We’ve learned how to get each node’s binary split conditions. The only remaining logic is to combine the conditions.

def get_decision_path_string(node_id, features, thresholds, hierarchy, 
  feature_names):
  conditions_df = pd.DataFrame(get_decision_path(node_id, features, thresholds, hierarchy, feature_names))
  conditions_df.columns = ['feature', 'condition', 'threshold']

  left_conditions_df = conditions_df[conditions_df.condition == '<=']
  right_conditions_df = conditions_df[conditions_df.condition == '>']

  # deduplication 
  left_conditions_df = left_conditions_df.groupby(['feature', 'condition'], as_index = False).min()
  right_conditions_df = right_conditions_df.groupby(['feature', 'condition'], as_index = False).max()
  
  # concatination
  fin_conditions_df = pd.concat([left_conditions_df, right_conditions_df])\
      .sort_values(['feature', 'condition'], ascending = False)
  
  # formatting 
  fin_conditions_df['cond_string'] = list(map(
      lambda x, y, z: '(%s %s %.2f)' % (x, y, z),
      fin_conditions_df.feature,
      fin_conditions_df.condition,
      fin_conditions_df.threshold
  ))
  return ' and '.join(fin_conditions_df.cond_string.values)

print(get_decision_path_string(12, features, thresholds, hierarchy, 
  feature_names))
# (poutcome_success > 0.50)

Now, we can calculate the conditions for each group.

leaves_df['condition'] = leaves_df['node_id'].map(
  lambda x: get_decision_path_string(x, features, thresholds, hierarchy, 
  feature_names)
)

The last step is to add their size and conversion to the groups.

leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total)\
  .map(lambda x: int(round(x/100)))
leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()

Now, we can use these rules to make decisions. We can sort groups by conversion (probability of successful contact) and pick the customers with the highest probability.

leaves_df.sort_values('conversion', ascending = False)\
  .drop('node_id', axis = 1).set_index('condition')

Image by author

Imagine we have resources to contact only around 10% of our user base, we can focus on the first three groups. Even with such a limited capacity, we would expect to get almost 40% conversion — it’s a really good result, and we’ve achieved it with just a bunch of straightforward heuristics.

In real life, it’s also worth testing the model (or heuristics) before deploying it in production. I would split the training dataset into training and validation parts (by time to avoid leakage) and see the heuristics performance on the validation set to have a better view of the actual model quality.

Working with high cardinality categories

Another topic that is worth discussing in this context is category encoding, since we have to encode the categorical variables for sklearn implementation. We’ve used a straightforward approach with one-hot encoding, but in some cases, it doesn’t work.

Imagine we also have a region in the data. I’ve synthetically generated English cities for each row. We have 155 unique regions, so the number of features has increased to 190.

model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 100, min_impurity_decrease=0.001)
model.fit(fin_df[feature_names], fin_df['y'])

So, the basic tree now has lots of conditions based on regions and it’s not convenient to work with them.

Image by author

In such a case, it might not be meaningful to explode the number of features, and it’s time to think about encoding. There’s a comprehensive article, “Categorically: Don’t explode — encode!”, that shares a bunch of different options to handle high cardinality categorical variables. I think the most feasible ones in our case will be the following two options:

Count or Frequency Encoder that shows good performance in benchmarks. This encoding assumes that categories of similar size would have similar characteristics.
Target Encoder, where we can encode the category by the mean value of the target variable. It will allow us to prioritise segments with higher conversion and deprioritise segments with lower. Ideally, it would be nice to use historical data to get the averages for the encoding, but we will use the existing dataset.

However, it will be interesting to test different approaches, so let’s split our dataset into train and test, saving 10% for validation. For simplicity, I’ve used one-hot encoding for all columns except for region (since it has the highest cardinality).

from sklearn.model_selection import train_test_split
fin_df = pd.get_dummies(df, columns=['job', 'marital', 'education', 
  'poutcome', 'contact'], dtype = int, drop_first = False)
train_df, test_df = train_test_split(fin_df,test_size=0.1, random_state=42)
print(train_df.shape[0], test_df.shape[0])
# (40689, 4522)

For convenience, let’s combine all the logic for parsing the tree into one function.

def get_model_definition(model, feature_names):
  n_nodes = model.tree_.node_count
  children_left = model.tree_.children_left
  children_right = model.tree_.children_right
  features = model.tree_.feature
  thresholds = model.tree_.threshold
  num_nodes = model.tree_.n_node_samples
  values = model.tree_.value

  hierarchy = {}

  for node_id in range(n_nodes):
      if children_left[node_id] != -1: 
          hierarchy[children_left[node_id]] = {
            'parent': node_id, 
            'condition': 'left'
          }
    
      if children_right[node_id] != -1:
            hierarchy[children_right[node_id]] = {
             'parent': node_id, 
             'condition': 'right'
            }

  leaves = []
  for node_id in range(n_nodes):
      if (children_left[node_id] == -1) and (children_right[node_id] == -1):
          leaves.append(node_id)
  leaves_df = pd.DataFrame({'node_id': leaves})
  leaves_df['condition'] = leaves_df['node_id'].map(
    lambda x: get_decision_path_string(x, features, thresholds, hierarchy, feature_names)
  )

  leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
  leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
  leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total).map(lambda x: int(round(x/100)))
  leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
  leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()
  leaves_df = leaves_df.sort_values('conversion', ascending = False)\
    .drop('node_id', axis = 1).set_index('condition')
  leaves_df['cum_share_of_total'] = leaves_df['share_of_total'].cumsum()
  leaves_df['cum_share_of_converted'] = leaves_df['share_of_converted'].cumsum()
  return leaves_df

Let’s create an encodings data frame, calculating frequencies and conversions.

region_encoding_df = train_df.groupby('region', as_index = False)\
  .aggregate({'id': 'count', 'y': 'mean'}).rename(columns = 
    {'id': 'region_count', 'y': 'region_target'})

Then, merge it into our training and validation sets. For the validation set, we will also fill NAs as averages.

train_df = train_df.merge(region_encoding_df, on = 'region')

test_df = test_df.merge(region_encoding_df, on = 'region', how = 'left')
test_df['region_target'] = test_df['region_target']\
  .fillna(region_encoding_df.region_target.mean())
test_df['region_count'] = test_df['region_count']\
  .fillna(region_encoding_df.region_count.mean())

Now, we can fit the models and get their structures.

count_feature_names = train_df.drop(
  ['y', 'id', 'region_target', 'region'], axis = 1).columns
target_feature_names = train_df.drop(
  ['y', 'id', 'region_count', 'region'], axis = 1).columns
print(len(count_feature_names), len(target_feature_names))
# (36, 36)

count_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
count_model.fit(train_df[count_feature_names], train_df['y'])

target_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
target_model.fit(train_df[target_feature_names], train_df['y'])

count_model_def_df = get_model_definition(count_model, count_feature_names)
target_model_def_df = get_model_definition(target_model, target_feature_names)

Let’s look at the structures and select the top categories up to 10–15% of our target audience. We can also apply these conditions to our validation sets to test our approach in practice.

Let’s start with Count Encoder.

Image by author

count_selected_df = test_df[
    (test_df.poutcome_success > 0.50) | 
    ((test_df.poutcome_success <= 0.50) & (test_df.age > 60.50)) | 
    ((test_df.region_count > 3645.50) & (test_df.region_count <= 8151.50) & 
         (test_df.poutcome_success <= 0.50) & (test_df.contact_cellular > 0.50) & (test_df.age <= 60.50))
]

print(count_selected_df.shape[0], count_selected_df.y.sum())
# (508, 227)

We can also see what regions have been selected, and it’s only Manchester.

Image by author

Let’s continue with the Target encoding.

Image by author

target_selected_df = test_df[
    ((test_df.region_target > 0.21) & (test_df.poutcome_success > 0.50)) | 
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month <= 6.50) & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month > 8.50) & (test_df.housing <= 0.50) 
         & (test_df.contact_unknown <= 0.50)) |
    ((test_df.region_target <= 0.21) & (test_df.poutcome_success > 0.50)) |
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month > 6.50) & (test_df.month <= 8.50) 
         & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50))
]

print(target_selected_df.shape[0], target_selected_df.y.sum())
# (502, 248)

We see a slightly lower number of selected users for communication but a significantly higher number of conversions: 248 vs. 227 (+9.3%).

Let’s also look at the selected categories. We see that the model picked up all the cities with high conversions (Manchester, Liverpool, Bristol, Leicester, and New Castle), but there are also many small regions with high conversions solely due to chance.

region_encoding_df[region_encoding_df.region_target > 0.21]\
  .sort_values('region_count', ascending = False)

Image by author

In our case, it doesn’t impact much since the share of such small cities is low. However, if you have way more small categories, you might see significant drawbacks of overfitting. Target Encoding might be tricky at this point, so it’s worth keeping an eye on the output of your model.

Luckily, there’s an approach that can help you overcome this issue. Following the article “Encoding Categorical Variables: A Deep Dive into Target Encoding”, we can add smoothing. The idea is to combine the group’s conversion rate with the overall average: the larger the group, the more weight its data carries, while smaller segments will lean more towards the global average.

First, I’ve selected the parameters that make sense for our distribution, looking at a bunch of options. I chose to use the global average for the groups under 100 people. This part is a bit subjective, so use common sense and your knowledge about the business domain.

import numpy as np
import matplotlib.pyplot as plt

global_mean = train_df.y.mean()

k = 100
f = 10
smooth_df = pd.DataFrame({'region_count':np.arange(1, 100001, 1) })
smooth_df['smoothing'] = (1 / (1 + np.exp(-(smooth_df.region_count - k) / f)))

ax = plt.scatter(smooth_df.region_count, smooth_df.smoothing)
plt.xscale('log')
plt.ylim([-.1, 1.1])
plt.title('Smoothing')

Image by author

Then, we can calculate, based on the selected parameters, the smoothing coefficients and blended averages.

region_encoding_df['smoothing'] = (1 / (1 + np.exp(-(region_encoding_df.region_count - k) / f)))
region_encoding_df['region_target'] = region_encoding_df.smoothing * region_encoding_df.raw_region_target \
    + (1 - region_encoding_df.smoothing) * global_mean

Then, we can fit another model with smoothed target category encoding.

train_df = train_df.merge(region_encoding_df[['region', 'region_target']], 
  on = 'region')
test_df = test_df.merge(region_encoding_df[['region', 'region_target']], 
  on = 'region', how = 'left')
test_df['region_target'] = test_df['region_target']\
  .fillna(region_encoding_df.region_target.mean())

target_v2_feature_names = train_df.drop(['y', 'id', 'region'], axis = 1)\
  .columns

target_v2_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
target_v2_model.fit(train_df[target_v2_feature_names], train_df['y'])
target_v2_model_def_df = get_model_definition(target_v2_model, 
  target_v2_feature_names)

Image by author

target_v2_selected_df = test_df[
    ((test_df.region_target > 0.12) & (test_df.poutcome_success > 0.50)) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month <= 6.50) & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month > 8.50) & (test_df.housing <= 0.50) 
         & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target <= 0.12) & (test_df.poutcome_success > 0.50) ) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month > 6.50) & (test_df.month <= 8.50) 
         & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50) )
]

target_v2_selected_df.shape[0], target_v2_selected_df.y.sum()
# (500, 247)

We can see that we’ve eliminated the small cities and prevented overfitting in our model while keeping roughly the same performance, capturing 247 conversions.

region_encoding_df[region_encoding_df.region_target > 0.12]

Image by author

You can also use TargetEncoder from sklearn, which smoothes and mixes the category and global means depending on the segment size. However, it also adds random noise, which is not ideal for our case of heuristics.

You can find the full code on GitHub.

Summary

In this article, we explored how to extract simple “rules” from data and use them to inform business decisions. We generated heuristics using a Decision Tree Classifier and touched on the important topic of categorical encoding since decision tree algorithms require categorical variables to be converted.

We saw that this rule-based approach can be surprisingly effective, helping you reach business decisions quickly. However, it’s worth noting that this simplistic approach has its drawbacks:

We are trading off the model’s power and accuracy for its simplicity and interpretability, so if you’re optimising for accuracy, choose another approach.
Even though we’re using a set of static heuristics, your data still can change, and they might become outdated, so you need to recheck your model from time to time.

Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

Dataset: Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306

The post Mining Rules from Data appeared first on Towards Data Science.

How to Optimize your Python Program for Slowness

Carl Kadie — Tue, 08 Apr 2025 00:25:35 +0000

Also available: A Rust version of this article.

Everyone talks about making Python programs faster [1, 2, 3], but what if we pursue the opposite goal? Let’s explore how to make them slower — absurdly slower. Along the way, we’ll examine the nature of computation, the role of memory, and the scale of unimaginably large numbers.

Our guiding challenge: write short Python programs that run for an extraordinarily long time.

To do this, we’ll explore a sequence of rule sets — each one defining what kind of programs we’re allowed to write, by placing constraints on halting, memory, and program state. This sequence isn’t a progression, but a series of shifts in perspective. Each rule set helps reveal something different about how simple code can stretch time.

Here are the rule sets we’ll investigate:

Anything Goes — Infinite Loop
Must Halt, Finite Memory — Nested, Fixed-Range Loops
Infinite, Zero-Initialized Memory — 5-State Turing Machine
Infinite, Zero-Initialized Memory — 6-State Turing Machine (>10↑↑15 steps)
Infinite, Zero-Initialized Memory — Plain Python (compute 10↑↑15 without Turing machine emulation)

Aside: 10↑↑15 is not a typo or a double exponent. It’s a number so large that “exponential” and “astronomical” don’t describe it. We’ll define it in Rule Set 4.

We start with the most permissive rule set. From there, we’ll change the rules step by step to see how different constraints shape what long-running programs look like — and what they can teach us.

Rule Set 1: Anything Goes — Infinite Loop

We begin with the most permissive rules: the program doesn’t need to halt, can use unlimited memory, and can contain arbitrary code.

If our only goal is to run forever, the solution is immediate:

while True:
  pass

This program is short, uses negligible memory, and never finishes. It satisfies the challenge in the most literal way — by doing nothing forever.

Of course, it’s not interesting — it does nothing. But it gives us a baseline: if we remove all constraints, infinite runtime is trivial. In the next rule set, we’ll introduce our first constraint: the program must eventually halt. Let’s see how far we can stretch the running time under that new requirement — using only finite memory.

Rule Set 2: Must Halt, Finite Memory — Nested, Fixed-Range Loops

If we want a program that runs longer than the universe will survive and then halts, it’s easy. Just write two nested loops, each counting over a fixed range from 0 to 10¹⁰⁰−1:

for a in range(10**100):
  for b in range(10**100):
      if b % 10_000_000 == 0:
          print(f"{a:,}, {b:,}")

You can see that this program halts after 10¹⁰⁰ × 10¹⁰⁰ steps. That’s 10²⁰⁰. And — ignoring the print—this program uses only a small amount of memory to hold its two integer loop variables—just 144 bytes.

My desktop computer runs this program at about 14 million steps per second. But suppose it could run at Planck speed (the smallest meaningful unit of time in physics). That would be about 10⁵⁰ steps per year — so 10¹⁵⁰ years to complete.

Current cosmological models estimate the heat death of the universe in 10¹⁰⁰ years, so our program will run about 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 times longer than the projected lifetime of the universe.

Aside: Practical concerns about running a program beyond the end of the universe are outside the scope of this article.

For an added margin, we can use more memory. Instead of 144 bytes for variables, let’s use 64 gigabytes — about what you’d find in a well-equipped personal computer. That’s about 500 million times more memory, which gives us about one billion variables instead of 2. If each variable iterates over the full 10¹⁰⁰ range, the total number of steps becomes roughly 10¹⁰⁰^(10⁹), or about 10^(100 billion) steps. At Planck speed — roughly 10⁵⁰ steps per year — that corresponds to 10^(100 billion − 50) years of computation.

Can we do better? Well, if we allow an unrealistic but interesting rule change, we can do much, much better.

Rule Set 3: Infinite, Zero-Initialized Memory — 5-State Turing Machine

What if we allow infinite memory — so long as it starts out entirely zero?

Aside: Why don’t we allow infinite, arbitrarily initialized memory? Because it trivializes the challenge. For example, you could mark a single byte far out in memory with a 0x01—say, at position 10¹²⁰—and write a tiny program that just scans until it finds it. That program would take an absurdly long time to run — but it wouldn’t be interesting. The slowness is baked into the data, not the code. We’re after something deeper: small programs that generate their own long runtimes from simple, uniform starting conditions.

My first idea was to use the memory to count upward in binary:

We can do that — but how do we know when to stop? If we don’t stop, we’re violating the “must halt” rule. So, what else can we try?

Let’s take inspiration from the father of Computer Science, Alan Turing. We’ll program a simple abstract machine — now known as a Turing machine — under the following constraints:

The machine has infinite memory, laid out as a tape that extends endlessly in both directions. Each cell on the tape holds a single bit: 0 or 1.
A read/write head moves across the tape. On each step, it reads the current bit, writes a new bit (0 or 1), and moves one cell left or right.

A read/write head positioned on an infinite tape.

The machine also has an internal variable called state, which can hold one of n values. For example, with 5 states, we might name the possible values A, B, C, D, and E—plus a special halting state H, which we don’t count among the five. The machine always starts in the first state, A.

We can express a full Turing machine program as a transition table. Here’s an example we’ll walk through step by step.

A 5-state Turing machine transition table.

Each row corresponds to the current tape value (0 or 1).
Each column corresponds to the current state (A through E).
Each entry in the table tells the machine what to do next:
- The first character is the bit to write (0 or 1)
- The second is the direction to move (L for left, R for right)
- The third is the next state to enter (A, B, C, D, E, or H, where H is the special halting state).

Now that we’ve defined the machine, let’s see how it behaves over time.

We’ll refer to each moment in time — the full configuration of the machine and tape — as a step. This includes the current tape contents, the head position, and the machine’s internal state (like A, B, or H).

Below is Step 0. The head is pointing to a 0 on the tape, and the machine is in state A.

Looking at row 0, column A in the program table, we find the instruction 1RB. That means:

Write 1 to the current tape cell.
Move the head Right.
Enter state B.

Step 0:

This puts us in Step 1:

The machine is now in state B, pointing at the next tape cell (again 0).

What will happen if we let this Turing machine keep running? It will run for exactly 47,176,870 steps — and then halt.

Aside: With a Google sign in, you can run this yourself via a Python notebook on Google Colab. Alternatively, you can copy and run the notebook locally on your own computer by downloading it from GitHub.

That number 47,176,870 is astonishing on its own, but seeing the full run makes it more tangible. We can visualize the execution using a space-time diagram, where each row shows the tape at a single step, from top (earliest) to bottom (latest). In the image:

The first row is blank — it shows the all-zero tape before the machine takes its first step.
1s are shown in orange.
0s are shown in white.
Light orange appears where 0s and 1s are so close together they blend.

Space-time diagram for the champion 5-state Turing machine. It runs for 47,176,870 steps before halting. Each row shows the tape at a single step, starting from the top. Orange represents 1, white represents 0.

In 2023, an online group of amateur researchers organized through bbchallenge.org proved that this is the longest-running 5-state Turing machine that eventually halts.

Want to see this Turing machine in motion? You can watch the full 47-million-step execution unfold in this pixel-perfect video:

Or interact with it directly using the Busy Beaver Blaze web app.

The video generator and web app are part of busy-beaver-blaze, the open-source Python & Rust project that accompanies this article.

It’s hard to believe that such a small machine can run 47 million steps and still halt. But it gets even more astonishing: the team at bbchallenge.org found a 6-state machine with a runtime so long it can’t even be written with ordinary exponents.

Rule Set 4: Infinite, Zero-Initialized Memory — 6-State Turing Machine (>10↑↑15 steps)

As of this writing, the longest running (but still halting) 6-state Turing machine known to humankind is:

A   B   C   D   E   F
0   1RB 1RC 1LC 0LE 1LF 0RC
1   0LD 0RF 1LA 1RH 0RB 0RE

Here is a video showing its first 10 trillion steps:

And here you can run it interactively via a web app.

So, if we are patient — comically patient — how long will this Turing machine run? More than 10↑↑15 where “10 ↑↑ 15” means:

This is not the same as 10¹⁵ (which is just a regular exponent). Instead:

10¹ = 10
10¹⁰ = 10,000,000,000
10^10^10 is 10¹⁰⁰⁰⁰⁰⁰⁰⁰⁰⁰, already unimaginably large.
10↑↑4 is so large that it vastly exceeds the number of atoms in the observable universe.
10↑↑15 is so large that writing it in exponent notation becomes annoying.

Pavel Kropitz announced this 6-state machine on May 30, 2022. Shawn Ligocki has a great write up explaining both his and Pavel’s discoveries. To prove that these machines run so long and then halt, researchers used a mix of analysis and automated tools. Rather than simulating every step, they identified repeating structures and patterns that could be proven — using formal, machine-verified proofs — to eventually lead to halting.

Up to this point, we’ve been talking about Turing machines — specifically, the longest-known 5- and 6-state machines that eventually halt. We ran the 5-state champion to completion and watched visualizations to explore its behavior. But the discovery that it’s the longest halting machine with 5 states — and the identification of the 6-state contender — came from extensive research and formal proofs, not from running them step-by-step.

That said, the Turing machine interpreter I built in Python can run for millions of steps, and the visualizer written in Rust can handle trillions (see GitHub). But even 10 trillion steps isn’t an atom in a drop of water in the ocean compared to the full runtime of the 6-state machine. And running it that far doesn’t get us any closer to understanding why it runs so long.

Aside: Python and Rust “interpreted” the Turing machines up to some point — reading their transition tables and applying the rules step by step. You could also say they “emulated” them, in that they reproduced their behavior exactly. I avoid the word “simulated”: a simulated elephant isn’t an elephant, but a simulated computer is a computer.

Returning to our central challenge: we want to understand what makes a short program run for a long time. Instead of analyzing these Turing machines, let’s construct a Python program whose 10↑↑15 runtime is clear by design.

Rule Set 5: Infinite, Zero-Initialized Memory — Plain Python (compute 10↑↑15 without Turing machine emulation)

Our challenge is to write a small Python program that runs for at least 10↑↑15 steps, using any amount of zero-initialized memory.

To achieve this, we’ll compute the value of 10↑↑15 in a way that guarantees the program takes at least that many steps. The ↑↑ operator is called tetration—recall from Rule Set 4 that ↑↑ stacks exponents: for example, 10↑↑3 means 10^(10^10). It’s an extremely fast-growing function. We will program it from the ground up.

Rather than rely on built-in operators, we’ll define tetration from first principles:

Tetration, implemented by the function tetrate, as repeated exponentiation
Exponentiation, via exponentiate, as repeated multiplication
Multiplication, via multiply, as repeated addition
Addition, via add, as repeated increment

Each layer builds on the one below it, using only zero-initialized memory and in-place updates.

We’ll begin at the foundation — with the simplest operation of all: increment.

Increment

Here’s our definition of increment and an example of its use:

from gmpy2 import xmpz

def increment(acc_increment):
  assert is_valid_accumulator(acc_increment), "not a valid accumulator"
  acc_increment += 1

def is_valid_accumulator(acc):
  return isinstance(acc, xmpz) and acc >= 0  

b = xmpz(4)
print(f"++{b} = ", end="")
increment(b)
print(b)
assert b == 5

Output:

++4 = 5

We’re using xmpz, a mutable arbitrary-precision integer type provided by the gmpy2 library. It behaves like Python’s built-in int in terms of numeric range—limited only by memory—but unlike int, it supports in-place updates.

To stay true to the spirit of a Turing machine and to keep the logic minimal and observable, we restrict ourselves to just a few operations:

Creating an integer with value 0 (xmpz(0))
In-place increment (+= 1) and decrement (-= 1)
Comparing with zero

All arithmetic is done in-place, with no copies and no temporary values. Each function in our computation chain modifies an accumulator directly. Most functions also take an input value a, but increment—being the most basic—does not. We use descriptive names like increment_acc, add_acc, and so on to make the operation clear and to support later functions where multiple accumulators will appear together.

Aside: Why not use Python’s built-in int type? It supports arbitrary precision and can grow as large as your memory allows. But it’s also immutable, meaning any update like += 1 creates a new integer object. Even if you think you’re modifying a large number in place, Python is actually copying all of its internal memory—no matter how big it is.
For example:

x = 10**100
y = x
x += 1
assert x == 10**100 + 1 and y == 10**100

Even though x and y start out identical, x += 1 creates a new object—leaving y unchanged. This behavior is fine for small numbers, but it violates our rules about memory use and in-place updates. That’s why we use gmpy2.xmpz, a mutable arbitrary-precision integer that truly supports efficient, in-place changes.

Addition

With increment defined, we next define addition as repeated incrementing.

def add(a, add_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(add_acc), "not a valid accumulator"
  for _ in range(a):
      add_acc += 1

def is_valid_other(a):
  return isinstance(a, int) and a >= 0      

a = 2
b = xmpz(4)
print(f"Before: id(b) = {id(b)}")
print(f"{a} + {b} = ", end="")
add(a, b)
print(b)
print(f"After:  id(b) = {id(b)}")  # ← compare object IDs
assert b == 6

Output:

Before: id(b) = 2082778466064
2 + 4 = 6
After:  id(b) = 2082778466064

The function adds a to add_acc by incrementing add_acc one step at a time, a times. The before and after ids are the same, showing that no new object was created—add_acc was truly updated in place.

Aside: You might wonder why add doesn’t just call our increment function. We could write it that way—but we’re deliberately inlining each level by hand. This keeps all loops visible, makes control flow explicit, and helps us reason precisely about how much work each function performs.

Even though gmpy2.xmpz supports direct addition, we don’t use it. We’re working at the most primitive level possible—incrementing by 1—to keep the logic simple, intentionally slow, and to make the amount of work explicit.

As with increment_acc, we update add_acc in place, with no copying or temporary values. The only operation we use is += 1, repeated a times.

Next, we define multiplication.

Multiplication

With addition in place, we can now define multiplication as repeated addition. Here’s the function and example usage. Unlike add and increment, this one builds up a new xmpz value from zero and returns it.

def multiply(a, multiply_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(multiply_acc), "not a valid accumulator"

  add_acc = xmpz(0)
  for _ in count_down(multiply_acc):
      for _ in range(a):
          add_acc += 1
  return add_acc

def count_down(acc):
  assert is_valid_accumulator(acc), "not a valid accumulator"
  while acc > 0:
      acc -= 1
      yield

a = 2
b = xmpz(4)
print(f"{a} * {b} = ", end="")
c = multiply(a, b)
print(c)
assert c == 8
assert b == 0

Output:

2 * 4 = 8

This multiplies a by the value of multiply_acc by adding a to add_acc once for every time multiply_acc can be decremented. The result is returned and then assigned to c. The original multiply_acc is decremented to zero and consumed in the process.

You might wonder what this line does:

for _ in count_down(multiply_acc):

While xmpz technically works with range(), doing so converts it to a standard Python int, which is immutable. That triggers a full copy of its internal memory—an expensive operation for large values. Worse, each decrement step would involve allocating a new integer and copying all previous bits, so what should be a linear loop ends up doing quadratic total work. Our custom count_down() avoids all that by decrementing in place, yielding control without copying, and maintaining predictable memory use.

We’ve built multiplication from repeated addition. Now it’s time to go a layer further: exponentiation.

Exponentiation

We define exponentiation as repeated multiplication. As before, we perform all work using only incrementing, decrementing, and in-place memory. As with multiply, the final result is returned while the input accumulator is consumed.

Here’s the function and example usage:

def exponentiate(a, exponentiate_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(exponentiate_acc), "not a valid accumulator"
  assert a > 0 or exponentiate_acc != 0, "0^0 is undefined"

  multiply_acc = xmpz(0)
  multiply_acc += 1
  for _ in count_down(exponentiate_acc):
      add_acc = xmpz(0)
      for _ in count_down(multiply_acc):
          for _ in range(a):
              add_acc += 1
      multiply_acc = add_acc
  return multiply_acc


a = 2
b = xmpz(4)
print(f"{a}^{b} = ", end="")
c = exponentiate(a, b)
print(c)
assert c == 16
assert b == 0

Output:

2^4 = 16

This raises a to the power of exponentiate_acc, using only incrementing, decrementing, and loop control. We initialize multiply_acc to 1 with a single increment—because repeatedly multiplying from zero would get us nowhere. Then, for each time exponentiate_acc can be decremented, we multiply the current result (multiply_acc) by a. As with the earlier layers, we inline the multiply logic directly instead of calling the multiply function—so the control flow and step count stay fully visible.

Aside: And how many times is += 1 called? Obviously at least 2⁴ times—because our result is 2⁴, and we reach it by incrementing from zero. More precisely, the number of increments is:

• 1 increment — initializing multiply_acc to one

Then we loop four times, and in each loop, we multiply the current value of multiply_acc by a = 2, using repeated addition:

• 2 increments — for multiply_acc = 1, add 2 once
• 4 increments — for multiply_acc = 2, add 2 twice
• 8 increments — for multiply_acc = 4, add 2 four times
• 16 increments — for multiply_acc = 8, add 2 eight times

That’s a total of 1 + 2 + 4 + 8 + 16 = 31 increments, which is 2⁵-1. In general, the number of calls to increment will be exponential, but the number is not the same exponential that we are computing.

With exponentiation defined, we’re ready for the top of our tower: tetration.

Tetration

Here’s the function and example usage:

def tetrate(a, tetrate_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(tetrate_acc), "not a valid accumulator"
  assert a > 0, "we don't define 0↑↑b"

  exponentiate_acc = xmpz(0)
  exponentiate_acc += 1
  for _ in count_down(tetrate_acc):
      multiply_acc = xmpz(0)
      multiply_acc += 1
      for _ in count_down(exponentiate_acc):
          add_acc = xmpz(0)
          for _ in count_down(multiply_acc):
              for _ in range(a):
                  add_acc += 1
          multiply_acc = add_acc
      exponentiate_acc = multiply_acc
  return exponentiate_acc


a = 2
b = xmpz(3)
print(f"{a}↑↑{b} = ", end="")
c = tetrate(a, b)
print(c)
assert c == 16  # 2^(2^2)
assert b == 0   # Confirm tetrate_acc is consumed

Output:

2↑↑3 = 16

This computes a ↑↑ tetrate_acc, meaning it exponentiates a by itself repeatedly, tetrate_acc times.

For each decrement of tetrate_acc, we exponentiate the current value. We in-line the entire exponentiate and multiply logic again, all the way down to repeated increments.

As expected, this computes 2^(2^2) = 16. With a Google sign-in, you can run this yourself via a Python notebook on Google Colab. Alternatively, you can copy the notebook from GitHub and then run it on your own computer.

We can also run tetrate on 10↑↑15. It will start running, but it won’t stop during our lifetimes — or even the lifetime of the universe:

a = 10
b = xmpz(15)
print(f"{a}↑↑{b} = ", end="")
c = tetrate(a, b)
print(c)

Let’s compare this tetrate function to what we found in the previous Rule Sets.

Rule Set 1: Anything Goes — Infinite Loop

Recall our first function:

while True:
  pass

Unlike this infinite loop, our tetrate function eventually halts — though not anytime soon.

Rule Set 2: Must Halt, Finite Memory — Nested, Fixed-Range Loops

Recall our second function:

for a in range(10**100):
  for b in range(10**100):
      if b % 10_000_000 == 0:
          print(f"{a:,}, {b:,}")

Both this function and our tetrate function contain a fixed number of nested loops. But tetrate differs in an important way: the number of loop iterations grows with the input value. In this function, in contrast, each loop runs from 0 to 10¹⁰⁰-1—a hardcoded bound. In contrast, tetrate’s loop bounds are dynamic — they grow explosively with each layer of computation.

Rule Sets 3 & 4: Infinite, Zero-Initialized Memory — 5- and 6-State Turing Machines

Compared to the Turing machines, our tetrate function has a clear advantage: we can directly see that it will call += 1 more than 10↑↑15 times. Even better, we can also see — by construction — that it halts.

What the Turing machines offer instead is a simpler, more universal model of computation — and perhaps a more principled definition of what counts as a “small program.”

Conclusion

So, there you have it — a journey through writing absurdly slow programs. Along the way, we explored the outer edges of computation, memory, and performance, using everything from deeply nested loops to Turing machines to a hand-inlined tetration function.

Here’s what surprised me:

Nested loops are enough.
If you just want a short program that halts after outliving the universe, two nested loops with 144 bytes of memory will do the job. I hadn’t realized it was that simple.
Turing machines escalate fast.
The jump from 5 to 6 states unleashes a dramatic leap in complexity and runtime. Also, the importance of starting with zero-initialized memory is obvious in retrospect — but it wasn’t something I’d considered before.
Python’s int type can kill performance
Yes, Python integers are arbitrary precision, which is great. But they’re also immutable. That means every time you do something like x += 1, Python silently allocates a brand-new integer object—copying all the memory of x, no matter how big it is. It feels in-place, but it’s not. This behavior turns efficient-looking code into a performance trap when working with large values. To get around this, we use the gmpy2.xmpz type—a mutable, arbitrary-precision integer that allows true in-place updates.
There’s something beyond exponentiation — and it’s called tetration.
I didn’t know this. I wasn’t familiar with the ↑↑ notation or the idea that exponentiation could itself be iterated to form something even faster-growing. It was surprising to learn how compactly it can express numbers that are otherwise unthinkably large.
And because I know you’re asking — yes, there’s something beyond tetration too. It’s called pentation, then hexation, and so on. These are part of a whole hierarchy known as hyperoperations. There’s even a metageneralization: systems like the Ackermann function and fast-growing hierarchies capture entire families of these functions and more.
Writing Tetration with Explicit Loops Was Eye-Opening
I already knew that exponentiation is repeated multiplication, and so on. I also knew this could be written recursively. What I hadn’t seen was how cleanly it could be written as nested loops, without copying values and with strict in-place updates.

Thank you for joining me on this journey. I hope you now have a clearer understanding of how small Python programs can run for an astonishingly long time — and what that reveals about computation, memory, and minimal systems. We’ve seen programs that halt only after the universe dies, and others that run even longer.

All code from this article is available in an open-source GitHub repository.

Please follow Carl on Towards Data Science and on @carlkadie.bsky.social. I write on scientific programming in Python and Rust, machine learning, and statistics. I tend to write about one article per month.

The post How to Optimize your Python Program for Slowness appeared first on Towards Data Science.

How I Would Learn To Code (If I Could Start Over)

Egor Howell — Fri, 04 Apr 2025 18:43:36 +0000

According to various sources, the average salary for Coding jobs is ~£47.5k in the UK, which is ~35% higher than the median salary of about £35k.

So, coding is a very valuable skill that will earn you more money, not to mention it’s really fun.

I have been coding professionally now for 4 years, working as a data scientist and machine learning engineer and in this post, I will explain how I would learn to code if I had to do it all over again.

My journey

I still remember the time I wrote my first bit of code.
It was 9am on the first day of my physics undergrad, and we were in the computer lab.

The professor explained that computation is an integral part of modern physics as it allows us to run large-scale simulations of everything from subatomic particle collisions to the movement of galaxies.

It sounded amazing.

And the way we started this process was by going through a textbook to learn Fortran.

Yes, you heard that right.

My first programming language was Fortran, specifically Fortran 90.
I learned DO loops before FOR loops. I am definitely a rarity in this case.

In that first lab session, I remember writing “Hello World” as is the usual rite of passage and thinking, “Big woop.”

This is how you write “Hello World” in Fortran in case you are interested.

program hello
print *, 'Hello World!'
end program hello

I actually really struggled to code in Fortran and didn’t do that well on tests we had, which put me off coding.

I still have some old coding projects in Fortran on my GitHub that you can check out.

Looking back, the learning curve to coding is quite steep, but it really does compound, and eventually, it will just click.

I didn’t realise this at the time and actively avoided programming modules in my physics degree, which I regret in hindsight as my progress would have been much quicker.

During my third year, I had to do a research placement as part of my master’s. The company I chose to work for/with used a graphical programming language called LabVIEW to run and manage their experiments.

LabVIEW is based on something called “G” and taught me to think of programming differently than script-based.

However, I haven’t used it since and probably never will, but it was cool to learn then.

I did enjoy the research year somewhat, but the pace at which research moves, at least in physics, is painfully slow. Nothing like the “heyday” from the early 20th century I envisioned.

One day after work a video was recommended to me on my YouTube home page.

For those of you unaware, this was a documentary about DeepMind’s AI AlphaGo that beat the best GO player in the world. Most people thought that an AI could never be good at GO.

From the video, I started to understand how AI worked and learn about neural networks, reinforcement learning, and deep learning.
I found it all so interesting, similar to physics research in the early 20th century.

Ultimately, this is when I started studying for a career in Data Science and machine learning, where I needed to teach myself Python and SQL.

This is where I so-called “fell in love” with coding.
I saw its real potential in actually solving problems, but the main thing was that I had a motivated reason to learn. I was studying to break into a career I wanted to be in, which really drove me.

I then became a data scientist for three years and am now a Machine Learning engineer. During this time, I worked extensively with Python and SQL.

Until a few months ago, those were the only programming languages I knew. I did learn other tools, such as bash/z-shell, AWS, docker, data bricks, snowflake, etc. but not any other “proper” programming languages.

In my spare time, I dabbled a bit with C a couple of years ago, but I have forgotten virtually all of it now. I have some basic scripts on my GitHub if you are interested.

However, in my new role that I started a couple of months ago, I will be using Rust and GO, which I am very much looking forward to learning.

If you are interested in my entire journey to becoming a data scientist and machine learning engineer, you can read about it below:

How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp)

Choose a language

I always recommend starting with a single language.

According to TestGorilla, there are over 8,000 programming languages, so how do you pick one?

Well, I would argue that many of these are useless for most jobs and have probably been developed as pet projects or for really niche cases.

You could choose your first language based on popularity. The Stack Overflow 2024 survey has great information on this. The most popular languages are JavaScript, Python, SQL, and Java.

However, the way I recommend you choose your first language should be based on what you want to do or work as.

Front-end web — JavaScript, HTML, CSS
Back-end web — Java, C#, Python, PHP or GO
iOS/macOS apps — Swift
Andriod apps — Kotlin or Java
Games — C++ or C
Embedded Systems — C or C++
Data science/machine learning / AI — Python and SQL

As I wanted to work in the AI/ML space, I focused my energy mainly on Python and some on SQL. It was probably a 90% / 10% split as SQL is smaller and easier to learn.

To this day, I still only know Python and SQL to a “professional” standard, but that’s fine, as pretty much the whole machine-learning community requires these languages.

This shows that you don’t need to know many languages; I have progressed quite far in my career, only knowing two to a significant depth. Of course, it would vary by sector, but the main point still stands.

So, pick a field you want to enter and choose the most in-demand and relevant language in that field.

Learn the bare minimum

The biggest mistake I see beginners make is getting stuck in “tutorial hell.”

This is where you take course after course but never branch out on your own.

I recommend taking a maximum of two courses on a language — literally any intro course would do — and then starting to build immediately.

And I literally mean, build your own projects and experience the struggle because that’s where learning is done.

You won’t know how to write functions until you do it yourself, you won’t know how to create classes until you do it yourself, and you literally won’t understand loops until you implement them yourself.

So, learn the bare minimum and immediately start experimenting; I promise it will at least 2x your learning curve.

You probably have heard this advice a lot, but in reality it is that simple.

I always say that most things in life are simple but hard to do, especially in programming.

Avoid trends

When I say avoid trends, I don’t mean not to focus on areas that are doing well or in demand in the market.

What I am saying is that when you pick a certain language or specialism, stick with it.

Programming languages all share similar concepts and patterns, so when you learn one, you indirectly improve your ability to pick up another later.

But you still should focus on one language for at least a few months.

Don’t develop “shiny object syndrome” and chase the latest technologies; it’s a game that you will unfortunately lose.

There have been so many “distracting” technologies, such as blockchain, Web3, AI, the list goes on.

Instead, focus on the fundamentals:

Data types
Design patterns
Object-oriented programming
Data structures and algorithms
Problem-solving skills

These topics transcend individual programming languages and are much better to master than the latest Javascript framework!

It’s much better to have a strong understanding of one area than try to learn everything. Not only is this more manageable, but it is also better for your long-term career.

As I said earlier, I have progressed quite well in my career by only knowing Python and SQL, as I learned the required technologies for the field and didn’t get distracted.

I can’t stress how much leverage you will have in your career if you document your learning publicly.

Document your learning

I don’t know why more people don’t do this. Sharing what I have learned online has been the biggest game changer for my career.

Literally committing your code on GitHub is enough, but I really recommend posting on LinkedIn or X, and ideally, you should create blog posts to help you cement your understanding and show off you knowledge to employers.

When I interview candidates, if they have some sort of online presence showing their learnings, that’s immediately a tick in my box and an extra edge over other applicants.

It shows enthusiasm and passion, not to mention increasing your surface area of serendipity.

I know many people are scared to do this, but you are suffering from the spotlight effect. Wikipedia defines this as:

The spotlight effect is the psychological phenomenon by which people tend to believe they are being noticed more than they really are.

No one literally cares if you post online or think about you as much as 1% as you think.

So, start posting.

What about AI?

I could spend hours discussing why AI is not an immediate risk for anyone who wants to work in the coding profession.

You should embrace AI as part of your toolkit, but that’s as far as it will go, and it will definitely not replace programmers in 5 years.

Unless an AGI breakthrough suddenly occurs in the next decade, which is highly unlikely.

I personally doubt the answer to AGI is the cross-entropy loss function, which is what is used in most LLMs nowadays.

It has been shown time and time again that these AI models lack strong mathematical reasoning abilities, which is one of the most fundamental skills to being a good coder.

Even the so-called “software engineer killer” Devin is not as good as the creators initially marketed it.

Most companies are simply trying to boost their investment by hyping AI, and their results are often over-exaggerated with controversial benchmark testing.

When I was building a website, ChatGPT even struggled with simple HTML and CSS, which you can argue is its bread and butter!

Overall, don’t worry about AI if you want to work as a coder; there is much, much bigger fish to fry before we cross that bridge!

NeetCode has done a great video explaining how current AI is incapable of replacing programmers.

Another thing!

Join my free newsletter, Dishing the Data, where I share weekly tips, insights, and advice from my experience as a practicing data scientist. Plus, as a subscriber, you’ll get my FREE Data Science Resume Template!

Connect with me

The post How I Would Learn To Code (If I Could Start Over) appeared first on Towards Data Science.

Creating an AI Agent to Write Blog Posts with CrewAI

Gustavo Santos — Fri, 04 Apr 2025 18:11:49 +0000

Introduction

I love writing. You may notice that if you follow me or my blog. For that reason, I am constantly producing new content and talking about Data Science and Artificial Intelligence.

I discovered this passion a couple of years ago when I was just starting my path in Data Science, learning and evolving my skills. At that time, I heard some more experienced professionals in the area saying that a good study technique was practicing new skills and writing about it somewhere, teaching whatever you learned.

In addition, I had just moved to the US, and nobody knew me here. So I had to start somewhere, creating my professional image in this competitive market. I remember I talked to my cousin, who’s also in the Tech industry, and he told me: write blog posts about your experiences. Tell people what you are doing. And so I did.

And I never stopped.

Fast forward to 2025, now I have almost two hundred published articles, many of them with Towards Data Science, a published Book, and a good audience.

Writing helped me so much in the Data Science area.

Most recently, one of my interests has been the amazing Natural Language Processing and Large Language Models subjects. Learning about how these modern models work is fascinating.

That interest led me to experiment with Agentic Ai as well. So, I learned about CrewAI, an easy and open-source package that helps us build AI agents in a fun and easy way, with little code. I decided to test it by creating a crew of agents to write a blog post, and then see how that goes.

In this post, we will learn how to create those agents and make them work together to produce a simple blog post.

Let’s do that.

What is a Crew?

A crew of AI agents is a combination of two or more agents, each of them performing a task towards a final goal.

In this case study, we will create a crew that will work together to produce a small blog post about a given topic that we will provide.

Crew of Agents workflow. Image by the author

The flow works like this:

We choose a given topic for the agents to write about.
Once the crew is started, it will go to the knowledge base, read some of my previously written articles, and try to mimic my writing style. Then, it generates a set of guidelines and passes it to the next agent.
Next, the Planner agent takes over and searches the Internet looking for good content about the topic. It creates a plan of content and sends it to the next agent.
The Writer agent receives the writing plan and executes it according to the context and information received.
Finally, the content is passed to the last agent, the Editor, who reviews the content and returns the final document as the output.

In the following section, we will see how this can be created.

Code

CrewAI is a great Python package because it simplifies the code for us. So, let’s begin by installing the two needed packages.

pip install crewai crewai-tools

Next, if you want, you can follow the instructions on their Quickstart page and have a full project structure created for you with just a couple of commands on a terminal. Basically, it will install some dependencies, generate the folder structure suggested for CrewAI projects, as well as generate some .yaml and .py files.

I personally prefer to create those myself, but it is up to you. The page is listed in the References section.

Folder Structure

So, here we go.

We will create these folders:

knowledge
config

And these files:

In the config folder: create the files agents.yaml and tasks.yaml
In the knowledge folder, that’s where I will add the files with my writing style.
In the project root: create crew.py and main.py.

Folders structure. Image by the author.

Make sure to create the folders with the names mentioned, as CrewAI looks for agents and tasks inside the config folder and for the knowledge base within a knowledge folder.

Next, let us set our agents.

Agents

The agents are composed of:

Name of the agent: writer_style
Role: LLMs are good role players, so here you can tell them which role to play.
Goal: tell the model what the goal of that agent is.
Backstory: Describe the story behind this agent, who it is, what it does.

writer_style:
  role: >
    Writing Style Analyst
  goal: >
    Thoroughly read the knowledge base and learn the characteristics of the crew, 
    such as tone, style, vocabulary, mood, and grammar.
  backstory: >
    You are an experienced ghost writer who can mimic any writing style.
    You know how to identify the tone and style of the original writer and mimic 
    their writing style.
    Your work is the basis for the Content Writer to write an article on this topic.

I won’t bore you with all the agents created for this crew. I believe you got the idea. It is a set of prompts explaining to each agent what they are going to do. All the agents instructions are stored in the agents.yaml file.

Think of it as if you were a manager hiring people to create a team. Think about what kinds of professionals you would need, and what skills are needed.

We need 4 professionals who will work towards the final goal of producing written content: (1) a Writer Stylist, (2) a Planner, (3) a Writer, and (4) an Editor.

If you want to see the setup for them, just check the full code in the GitHub repository.

Tasks

Now, back to the analogy of the manager hiring people, once we “hired” our entire crew, it is time to separate the tasks. We know that we want to produce a blog post, we have 4 agents, but what each of them will do.

Well, that will be configured in the file tasks.yaml.

To illustrate, let me show you the code for the Writer agent. Once again, these are the parts needed for the prompt:

Name of the task: write
Description: The description is like telling the professional how you want that task to be performed, just like we would tell a new hire how to perform their new job. Give precise instructions to get the best result possible.
Expected output: This is how we want to see the output. Notice that I give instructions like the size of the blog post, the quantity of paragraphs, and other information that helps my agent to give me the expected output.
Agent to perform it: Here, we are indicating the agent who will perform this task, using the same name set in the agents.yaml file.
Output file: Now always applicable, but if so, this is the argument to use. We asked for a markdown file as output.

write:
  description: >
    1. Use the content plan to craft a compelling blog post on {topic}.
    2. Incorporate SEO keywords naturally.
    3. Sections/Subtitles are properly named in an engaging manner. Make sure 
    to add Introduction, Problem Statement, Code, Before You Go, References.
    4. Add a summarizing conclusion - This is the "Before You Go" section.
    5. Proofread for grammatical errors and alignment with the writer's style.
    6. Use analogies to make the article more engaging and complex concepts easier
    to understand.
  expected_output: >
    A well-written blog post in markdown format, ready for publication.
    The article must be within a 7 to 12 minutes read.
    Each section must have at least 3 paragraphs.
    When writing code, you will write a snippet of code and explain what it does. 
    Be careful to not add a huge snippet at a time. Break it in reasonable chunks.
    In the examples, create a sample dataset for the code.
    In the Before You Go section, you will write a conclusion that is engaging
    and factually accurate.
  agent: content_writer
  output_file: blog_post.md

After the agents and tasks are defined, it is time to create our crew flow.

Coding the Crew

Now we will create the file crew.py, where we will translate the previously presented flow to Python code.

We begin by importing the needed modules.

#Imports
import os
from crewai import Agent, Task, Process, Crew, LLM
from crewai.project import CrewBase, agent, crew, task
from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource
from crewai_tools import SerperDevTool

We will use the basic Agent, Task, Crew, Process and LLM to create our flow. PDFKnowledgeSource will help the first agent learning my writing style, and SerperDevTool is the tool to search the internet. For that one, make sure to get your API key at https://serper.dev/signup.

A best practice in software development is to keep your API keys and configuration settings separate from your code. We’ll use a .env file for this, providing a secure place to store these values. Here’s the command to load them into our environment.

from dotenv import load_dotenv
load_dotenv()

Then, we will use the PDFKnowledgeSource to show the Crew where to search for the writer’s style. By default, that tool looks at the knowledge folder of your project, thus the importance of the name being the same.

# Knowledge sources

pdfs = PDFKnowledgeSource(
    file_paths=['article1.pdf',
                'article2.pdf',
                'article3.pdf'
                ]
)

Now we can set up the LLM we want to use for the Crew. It can be any of them. I tested a bunch of them, and those I liked the most were qwen-qwq-32b and gpt-4o. If you choose OpenAI’s, you will need an API Key as well. For Qwen-QWQ, just uncomment the code and comment out the OpenAI’s lines.. You need an API key from Groq.

# LLMs

llm = LLM(
    # model="groq/qwen-qwq-32b",
    # api_key= os.environ.get("GROQ_API_KEY"),
    model= "gpt-4o",
    api_key= os.environ.get("OPENAI_API_KEY"),
    temperature=0.4
)

Now we have to create a Crew Base, showing where CrewAI can find the agents and tasks configuration files.

# Creating the crew: base shows where the agents and tasks are defined

@CrewBase
class BlogWriter():
    """Crew to write a blog post"""
    agents_config = "config/agents.yaml"
    tasks_config = "config/tasks.yaml"

Agents Functions

And we are ready to create the code for each agent. They are composed of a decorator @agent to show that the following function is an agent. We then use the class Agent and indicate the name of the agent in the config file, the level of verbosity, being 1 low, 2 high. You can also use a Boolean value, such as true or false.

Lastly, we specify if the agent uses any tool, and what model it will use.

# Configuring the agents
    @agent
    def writer_style(self) -> Agent:
        return Agent(
                config=self.agents_config['writer_style'],
                verbose=1,
                knowledge_sources=[pdfs]
                )

    @agent
    def planner(self) -> Agent:
        return Agent(
        config=self.agents_config['planner'],
        verbose=True,
        tools=[SerperDevTool()],
        llm=llm
        )

    @agent
    def content_writer(self) -> Agent:
        return Agent(
        config=self.agents_config['content_writer'],
        verbose=1
        )

    @agent
    def editor(self) -> Agent:
        return Agent(
        config=self.agents_config['editor'],
        verbose=1
        )

Tasks Functions

The next step is creating the tasks. Similarly to the agents, we will create a function and decorate it with @task. We use the class Task to inherit CrewAI’s functionalities and then point to the task to be used from our tasks.yaml file to be used for each task created. If any output file is expected, use the output_file argument.

# Configuring the tasks    

    @task
    def style(self) -> Task:
        return Task(
        config=self.tasks_config['mystyle'],
        )

    @task
    def plan(self) -> Task:
        return Task(
        config=self.tasks_config['plan'],
        )

    @task
    def write(self) -> Task:
        return Task(
        config=self.tasks_config['write'],
        output_file='output/blog_post.md' # This is the file that will be contain the final blog post.
        )

    @task
    def edit(self) -> Task:
        return Task(
        config=self.tasks_config['edit']
        )

Crew

To glue everything together, we now create a function and decorate it with the @crew decorator. That function will line up the agents and the tasks in the order to be performed, since the process chosen here is the simplest: sequential. In other words, everything runs in sequence, from start to finish.

@crew

    def crew(self) -> Crew:
        """Creates the Blog Post crew"""

        return Crew(
            agents= [self.writer_style(), self.planner(), self.content_writer(), self.editor(), self.illustrator()],
            tasks= [self.style(), self.plan(), self.write(), self.edit(), self.illustrate()],
            process=Process.sequential,
            verbose=True
        )

Running the Crew

Running the crew is very simple. We create the main.py file and import the Crew Base BlogWriter created. Then we just use the functions crew().kickoff(inputs) to run it, passing a dictionary with the inputs to be used to generate the blog post.

# Script to run the blog writer project

# Warning control
import warnings
warnings.filterwarnings('ignore')
from crew import BlogWriter


def write_blog_post(topic: str):
    # Instantiate the crew
    my_writer = BlogWriter()
    # Run
    result = (my_writer
              .crew()
              .kickoff(inputs = {
                  'topic': topic
                  })
    )

    return result

if __name__ == "__main__":

    write_blog_post("Price Optimization with Python")

There it is. The result is a nice blog post created by the LLM. See below.

Resulting blog post. GIF by the author.

That is so nice!

Before You Go

Before you go, know that this blog post was 100% created by me. This crew I created was an experiment I wanted to do to learn more about how to create AI agents and make them work together. And, like I said, I love writing, so this is something I would be able to read and assess the quality.

My opinion is that this crew still didn’t do a very good job. They were able to complete the tasks successfully, but they gave me a very shallow post and code. I would not publish this, but at least it could be a start, maybe.

From here, I encourage you to learn more about CrewAI. I took their free course where João de Moura, the creator of the package, shows us how to create different kinds of crews. It is really interesting.

GitHub Repository

https://github.com/gurezende/Crew_Writer

About Me

If you want to learn more about my work, or follow my blog (it is really me!), here are my contacts and portfolio.

https://gustavorsantos.me

References

[Quickstart CrewAI](https://docs.crewai.com/quickstart)

[CrewAI Documentation](https://docs.crewai.com/introduction)

[GROQ](https://groq.com/)

[OpenAI](https://openai.com)

[CrewAI Free Course](https://learn.crewai.com/)

The post Creating an AI Agent to Write Blog Posts with CrewAI appeared first on Towards Data Science.

A Simple Implementation of the Attention Mechanism from Scratch

Marcello Politi — Tue, 01 Apr 2025 01:05:51 +0000

Introduction

The Attention Mechanism is often associated with the transformer architecture, but it was already used in RNNs. In Machine Translation or MT (e.g., English-Italian) tasks, when you want to predict the next Italian word, you need your model to focus, or pay attention, on the most important English words that are useful to make a good translation.

I will not go into details of RNNs, but attention helped these models to mitigate the vanishing gradient problem and to capture more long-range dependencies among words.

At a certain point, we understood that the only important thing was the attention mechanism, and the entire RNN architecture was overkill. Hence, Attention is All You Need!

Self-Attention in Transformers

Classical attention indicates where words in the output sequence should focus attention in relation to the words in input sequence. This is important in sequence-to-sequence tasks like MT.

The self-attention is a specific type of attention. It operates between any two elements in the same sequence. It provides information on how “correlated” the words are in the same sentence.

For a given token (or word) in a sequence, self-attention generates a list of attention weights corresponding to all other tokens in the sequence. This process is applied to each token in the sentence, obtaining a matrix of attention weights (as in the picture).

This is the general idea, in practice things are a bit more complicated because we want to add many learnable parameters to our neural network, let’s see how.

K, V, Q representations

Our model input is a sentence like “my name is Marcello Politi”. With the process of tokenization, a sentence is converted into a list of numbers like [2, 6, 8, 3, 1].

Before feeding the sentence into the transformer we need to create a dense representation for each token.

How to create this representation? We multiply each token by a matrix. The matrix is learned during training.

Let’s add some complexity now.

For each token, we create 3 vectors instead of one, we call these vectors: key, value and query. (We see later how we create these 3 vectors).

Conceptually these 3 tokens have a particular meaning:

The vector key represents the core information captured by the token
The vector value captures the full information of a token
The vector query, it’s a question about the token relevance for the current task.

So the idea is that we focus on a particular token i , and we want to ask what is the importance of the other tokens in the sentence regarding the token i we are taking into consideration.

This means that we take the vector q_i (we ask a question regarding i) for token i, and we do some mathematical operations with all the other tokens k_j (j!=i). This is like wondering at first glance what are the other tokens in the sequence that look really important to understand the meaning of token i.

What is this magical mathematical operation?

We need to multiply (dot-product) the query vector by the key vectors and divide by a scaling factor. We do this for each k_j token.

In this way, we obtain a score for each pair (q_i, k_j). We make this list become a probability distribution by applying a softmax operation on it. Great now we have obtained the attention weights!

With the attention weights, we know what is the importance of each token k_j to for undestandin the token i. So now we multiply the value vector v_j associated with each token per its weight and we sum the vectors. In this way we obtain the final context-aware vector of token_i.

If we are computing the contextual dense vector of token_1 we calculate:

z1 = a11*v1 + a12*v2 + … + a15*v5

Where a1j are the computer attention weights, and v_j are the value vectors.

Done! Almost…

I didn’t cover how we obtained the vectors k, v and q of each token. We need to define some matrices w_k, w_v and w_q so that when we multiply:

token * w_k -> k
token * w_q -> q
token * w_v -> v

These 3 matrices are set at random and are learned during training, this is why we have many parameters in modern models such as LLMs.

Multi-head Self-Attention in Transformers (MHSA)

Are we sure that the previous self-attention mechanism is able to capture all important relationships among tokens (words) and create dense vectors of those tokens that really make sense?

It could actually not work always perfectly. What if to mitigate the error we re-run the entire thing 2 times with new w_q, w_k and w_v matrices and somehow merge the 2 dense vectors obtained? In this way maybe one self-attention managed to capture some relationship and the other managed to capture some other relationship.

Well, this is what exactly happens in MHSA. The case we just discussed contains two heads because it has two sets of w_q, w_k and w_v matrices. We can have even more heads: 4, 8, 16 etc.

The only complicated thing is that all these heads are managed in parallel, we process the all in the same computation using tensors.

The way we merge the dense vectors of each head is simple, we concatenate them (hence the dimension of each vector shall be smaller so that when concat them we obtain the original dimension we wanted), and we pass the obtained vector through another w_o learnable matrix.

Hands-on

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">Python">import torch

Suppose you have a sentence. After tokenization, each token (word for simplicity) corresponds to an index (number):

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">tokenized_sentence = torch.tensor([
    2, #my
    6, #name
    8, #is
    3, #marcello
    1  #politi
])
tokenized_sentence

Before feeding the sentence into the transofrmer we need to create a dense representation for each token.

How to create these representation? We multiply each token per a matrix. This matrix is learned during training.

Let’s build this embedding matrix.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">torch.manual_seed(0) # set a fixed seed for reproducibility
embed = torch.nn.Embedding(10, 16)

If we multiply our tokenized sentence with the embeddings, we obtain a dense representation of dimension 16 for each token

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">sentence_embed = embed(tokenized_sentence).detach()
sentence_embed

In order to use the attention mechanism we need to create 3 new We define 3 matrixes w_q, w_k and w_v. When we multiply one input token time the w_q we obtain the vector q. Same with w_k and w_v.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">d = sentence_embed.shape[1] # let's base our matrix on a shape (16,16)

w_key = torch.rand(d,d)
w_query = torch.rand(d,d)
w_value = torch.rand(d,d)

Compute attention weights

Let’s now compute the attention weights for only the first input token of the sentence.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">token1_embed = sentence_embed[0]

#compute the tre vector associated to token1 vector : q,k,v
key_1 = w_key.matmul(token1_embed)
query_1 = w_query.matmul(token1_embed)
value_1 = w_value.matmul(token1_embed)

print("key vector for token1: \n", key_1)   
print("query vector for token1: \n", query_1)
print("value vector for token1: \n", value_1)

We need to multiply the query vector associated to token1 (query_1) with all the keys of the other vectors.

So now we need to compute all the keys (key_2, key_2, key_4, key_5). But wait, we can compute all of these in one time by multiplying the sentence_embed times the w_k matrix.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">keys = sentence_embed.matmul(w_key.T)
keys[0] #contains the key vector of the first token and so on

Let’s do the same thing with the values

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">values = sentence_embed.matmul(w_value.T)
values[0] #contains the value vector of the first token and so on

Let’s compute the first part of the attions formula.

import torch.nn.functional as F

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># the following are the attention weights of the first tokens to all the others
a1 = F.softmax(query_1.matmul(keys.T)/d**0.5, dim = 0)
a1

With the attention weights we know what is the importance of each token. So now we multiply the value vector associated to each token per its weight.

To obtain the final context aware vector of token_1.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">z1 = a1.matmul(values)
z1

In the same way we could compute the context aware dense vectors of all the other tokens. Now we are always using the same matrices w_k, w_q, w_v. We say that we use one head.

But we can have multiple triplets of matrices, so multi-head. That’s why it is called multi-head attention.

The dense vectors of an input tokens, given in oputut from each head are at then end concatenated and linearly transformed to get the final dense vector.

Implementing MultiheadSelf-Attention

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0) # fixed seed for reproducibility

Same steps as before…

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Tokenized sentence (same as yours)
tokenized_sentence = torch.tensor([2, 6, 8, 3, 1])  # [my, name, is, marcello, politi]

# Embedding layer: vocab size = 10, embedding dim = 16
embed = nn.Embedding(10, 16)
sentence_embed = embed(tokenized_sentence).detach()  # Shape: [5, 16] (seq_len, embed_dim)

We’ll define a multi-head attention mechanism with h heads (let’s say 4 heads for this example). Each head will have its own w_q, w_k, and w_v matrices, and the output of each head will be concatenated and passed through a final linear layer.

Since the output of the head will be concatenated, and we want a final dimension of d, the dimension of each head needs to be d/h. Additionally each concatenated vector will go though a linear transformation, so we need another matrix w_ouptut as you can see in the formula.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">d = sentence_embed.shape[1]  # embed dimension 16
h = 4  # Number of heads
d_k = d // h  # Dimension per head (16 / 4 = 4)

Since we have 4 heads, we want 4 copies for each matrix. Instead of copies, we add a dimension, which is the same thing, but we only do one operation. (Imagine stacking matrices on top of each other, its the same thing).

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Define weight matrices for each head
w_query = torch.rand(h, d, d_k)  # Shape: [4, 16, 4] (one d x d_k matrix per head)
w_key = torch.rand(h, d, d_k)    # Shape: [4, 16, 4]
w_value = torch.rand(h, d, d_k)  # Shape: [4, 16, 4]
w_output = torch.rand(d, d)  # Final linear layer: [16, 16]

I’m using for simplicity torch’s einsum. If you’re not familiar with it check out my blog post.

The einsum operation torch.einsum('sd,hde->hse', sentence_embed, w_query) in PyTorch uses letters to define how to multiply and rearrange numbers. Here’s what each part means:

Input Tensors:
- sentence_embed with the notation 'sd':
  - s represents the number of words (sequence length), which is 5.
  - d represents the number of numbers per word (embedding size), which is 16.
  - The shape of this tensor is [5, 16].
- w_query with the notation 'hde':
  - h represents the number of heads, which is 4.
  - d represents the embedding size, which again is 16.
  - e represents the new number size per head (d_k), which is 4.
  - The shape of this tensor is [4, 16, 4].
Output Tensor:
- The output has the notation 'hse':
  - h represents 4 heads.
  - s represents 5 words.
  - e represents 4 numbers per head.
  - The shape of the output tensor is [4, 5, 4].

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Compute Q, K, V for all tokens and all heads
# sentence_embed: [5, 16] -> Q: [4, 5, 4] (h, seq_len, d_k)
queries = torch.einsum('sd,hde->hse', sentence_embed, w_query)  # h heads, seq_len tokens, d dim
keys = torch.einsum('sd,hde->hse', sentence_embed, w_key)       # h heads, seq_len tokens, d dim
values = torch.einsum('sd,hde->hse', sentence_embed, w_value)   # h heads, seq_len tokens, d dim

This einsum equation performs a dot product between the queries (hse) and the transposed keys (hek) to obtain scores of shape [h, seq_len, seq_len], where:

h -> Number of heads.
s and k -> Sequence length (number of tokens).
e -> Dimension of each head (d_k).

The division by (d_k ** 0.5) scales the scores to stabilize gradients. Softmax is then applied to obtain attention weights:

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Compute attention scores
scores = torch.einsum('hse,hek->hsk', queries, keys.transpose(-2, -1)) / (d_k ** 0.5)  # [4, 5, 5]
attention_weights = F.softmax(scores, dim=-1)  # [4, 5, 5]

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Apply attention weights
head_outputs = torch.einsum('hij,hjk->hik', attention_weights, values)  # [4, 5, 4]
head_outputs.shape

Now we concatenate all the heads of token 1

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Concatenate heads
concat_heads = head_outputs.permute(1, 0, 2).reshape(sentence_embed.shape[0], -1)  # [5, 16]
concat_heads.shape

Let’s finally multiply per the last w_output matrix as in the formula above

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">multihead_output = concat_heads.matmul(w_output)  # [5, 16] @ [16, 16] -> [5, 16]
print("Multi-head attention output for token1:\n", multihead_output[0])

Final Thoughts

In this blog post I’ve implemented a simple version of the attention mechanism. This is not how it is really implemented in modern frameworks, but my scope is to provide some insights to allow anyone an understanding of how this works. In future articles I’ll go through the entire implementation of a transformer architecture.

Follow me on TDS if you like this article!

Linkedin | X (Twitter) | Website

Unless otherwise noted, images are by the author

The post A Simple Implementation of the Attention Mechanism from Scratch appeared first on Towards Data Science.

Create Your Supply Chain Analytics Portfolio to Land Your Dream Job

Samir Saci — Tue, 01 Apr 2025 00:46:33 +0000

Supply chains are under pressure like never before.

From climate-driven disruptions to geopolitical shifts, businesses must adapt to rising costs, new trade barriers and growing sustainability demands.

In this new world where supply chains face uncertainty, Supply Chain Analytics is essential to keep resilient operations.

Samir, can you advise me on how to build a supply chain analytics portfolio with actual projects?

Since publishing my first post on Towards Data Science on August 5th, 2020, I’ve frequently received this question from readers on LinkedIn or YouTube.

In this article, I’ll share my perspective—after nine years in the industry—on how I would use my Supply Chain Analytics Cheat Sheet to build a portfolio if I were starting out as a junior data scientist.

The Supply Chain Analytics Cheat Sheet

What is Supply Chain Analytics?

Let’s start by defining the terms we use.

Supply Chain Analytics refers to a set of tools and methodologies used to extract insights from data across all processes in the value chain.

Systems of a Supply Chain — (Image by Samir Saci)

For instance, a retail company may use:

An ERP to manage procurement, finance and sales
A Warehouse Management System to manage its distribution centres
A Transport Management System to manage inbound and outbound freight

4 types of Supply Chain Analytics to Answer Questions — (Image by Author)

As a Supply Chain Solution Manager and Data Scientist in the logistics industry, I have used analytics in international projects to design and optimize supply chain solutions.

I’ve shared many of these methodologies and tools in over 75 articles published on Towards Data Science.

I’ve compiled them into this concise and comprehensive Supply Chain Analytics Cheat Sheet.

Screenshot of the Supply Chain Analytics Cheat Sheet including 70+ case studies: Link – (Image by Samir Saci)

What’s Inside the Supply Chain Analytics Cheat Sheet?

Whether you want to reduce distribution costs, minimize your company’s environmental impact, or maximize profitability, you’ll find the answers to your questions here.

How Can Analytics Help Improve Profitability?

Data Analytics to Boost Business Profitability

The first section of the cheat sheet is about data analytics for Business Strategy.

Data Analytics for Business Strategy — (Image by Samir Saci)

It includes practical case studies on how to use data to support business executives in their strategic decision-making.

For instance, the series of articles, “Business Planning with Python”, is based on a real example of a business managed by my friend.

Example used in the Business Planning with Python articles — (Image by Samir Saci)

“We have to refuse orders as we don’t have enough cash to pay suppliers for stock replenishment.”

I built a simulation model based on this insight to help him understand weaknesses in his value chain and uncover growth opportunities.

They illustrate how you can add value to small, medium, and large business owners.

What About Optimizing Supply Chain Operations?

Supply Chain Analytics for Logistics Operations

Having spent years designing, monitoring, and optimizing supply chain solutions, I’ve compiled many case studies focused on warehousing and transportation operations.

Section of Logistics Operations Optimization — (Image by Samir Saci)

In this section, most case studies are based on an actual reengineering project I have conducted in Asia or Europe.

Country Manager: “Samir, we need to reduce warehousing costs by 15% if we want to renew the contract with the retail company XXX.”

They focus on optimizing a specific process in a warehouse (order preparation, value-added services) or transportation operations (routing, scheduling).

Go to the nearest warehouse and ask: ‘What are your problems?’ You can be sure they will find some for you.

Here’s how to get started:

Review the case studies to understand the problem statement and the solution.
Pull the source code from my GitHub repository.
Search for a similar problem in your company
Adapt the code to build a solution to your specific problem

The code is usually a simple Python script or a jupyter notebook that can be easily adapted.

What if you want to have a greater impact? Focus on a flow optimization.

Data Analytics for Supply Chain Optimization

The main driver of the reengineering projects I have conducted was cost.

Examples of Business Indicators along the value chain — (Image by Samir Saci)

Usually, customers tracked logistics costs, i.e. the percentage of turnover spent on logistic operations.

Therefore, we needed to find solutions (as a third-party logistic service provider) to reduce this percentage without impacting our profitability.

What if we delivered to the U.S. East Coast from a warehouse in Charlotte?

The solutions presented in the previous section are too localized. We need to take a step back and consider flow optimization.

Examples of Supply Chain Optimization — (Image by Author)

These case studies focus on the optimization of goods flow using

Replenishment rules and forecasting algorithms to optimize inventory
Linear/Non-Linear programming to match the supply with demand at the lowest cost
Statistical tools for diagnostic and improvement of specific processes

For some case studies, I have deployed the models in a web application developed for my startup, LogiGreen.

LogiGreen App publicly available — (Image by Samir Saci)

The demo version is publicly available for you to test the models; more information here.

What about sustainability?

If you want to support the green transformation of your company, I have some examples for you.

Supply Chain Analytics for Sustainability

Since my first project focused on sustainability, I was convinced that green transformation was similar to supply chain optimization.

Sustainable Supply Chain Optimization — (Image by Samir Saci)

Therefore, you can find 17 examples of optimization solutions using this approach to minimize CO2 emissions or resource usage.

Supply Chain Sustainability Section of the Cheat Sheet — (Image by Samir Saci)

I also decided to cover the reporting side of sustainability with analytics for Life Cycle Assessment, CO2 emissions calculations or ESG reporting.

If you need support getting started with sustainability projects, you’re covered.

Now It’s Your Turn to Build!

If you need a more detailed presentation of the cheat sheet, check out this short YouTube tutorial.

In the next section, I’ll share how I would approach building a portfolio if I were a junior engineer—or someone transitioning into analytics—looking to land a job or freelance project.

Starting your Analytics Portfolio

Let’s assume I’m a junior data scientist aiming to join a major retailer’s Supply Chain Analytics team.

I want to start a project to showcase how I can use my skills to help retailers improve their service and reduce costs.

Advice 1: Start with a Simple Project

For most companies, data maturity in supply chain departments is very low.

That means the implementation of advanced (and complex) algorithms can be very challenging.

Therefore, I would focus on:

Delivering business value (visibility, insights, diagnostics)
Smooth user experience of your product or analysis

Therefore, I would pick the topic of ABC Analysis and Product Segmentation.

My Article titled Product Segmentation for Retail with Python – (Image by Author)

This article provides multiple examples of analysis to segment products based on their demand variability and contribution to the turnover.

Pareto Chart generated with Python – (Image by Samir Saci)

The article includes a link to a GitHub repository with a Jupyter Notebook containing all the necessary code.

Advice 2: Add Business Value

My articles always use generic dummy data to feed the algorithms and visuals generated.

You can enrich this data by adapting it to the industry you’re targeting.

Fashion retailers usually have seasonality and complex master data
Cosmetics product categories are an essential demand driver that can affect the results of your forecast engine

Before jumping into the code, show that you can take ownership of the case study and adapt it to your vision of the problem to solve.

Advice 3: Code Refactoring and Packaging

My GitHub code is mostly in the form of Jupyter Notebook or standalone Python scripts.

This is a great opportunity for our junior data scientist to show that he can package the code into an API or even build a web application around it.

Indeed, currently data scientists are expected to ship their models in a form ready for productization.

Consider learning about script packaging, Docker containerization, and API development.

Advice 4: Improve the UI and add insights

Remember, your skills will be judged by the impact of the analytics products you design and deploy.

Therefore, do not hesitate to improve the outputs and insights of the models shared in my cheat sheet.

It is an excellent opportunity to ask your colleagues in supply chain operations how these tools can support them.

What KPIs are they tracking?
What kind of insights do they lack to pilot their operations?

From here, this case study is yours to make your own.

If you follow these steps, your portfolio will not be a copy of my GitHub repository but a reflection of your skills and how you can impact businesses.

This is precisely what I did when I built the demo version of the LogiGreen Apps.

Screens of the ABC Analysis Module: Link – (Image by Samir Saci)

The demo version is publicly available for you to test the models and get inspiration: more information here.

I’m looking forward to seeing your version of it!

Conclusion

I hope this brief introduction to the cheat sheet has helped clarify how you can start building your analytics portfolio.

Updates Section of the Analytics Cheat Sheet – (Image by Samir Saci)

Do not hesitate to bookmark this cheat sheet as I will update it each time a new content is published.

I want to use this article and the YouTube video as a forum to collect your feedback or questions.

Do not hesitate to use the video’s comment section to ask questions!

If you have used any case studies for some of your projects, I would be happy to learn more about the results.

About Me

Let’s connect on Linkedin and Twitter; I am a Supply Chain Engineer using data analytics to improve Logistics operations and reduce costs.

For consulting or advice on analytics and sustainable Supply Chain transformation, feel free to contact me via Logigreen Consulting.

Samir Saci | Data Science & Productivity
A technical blog focusing on Data Science, Personal Productivity, Automation, Operations Research and Sustainable…samirsaci.com

The post Create Your Supply Chain Analytics Portfolio to Land Your Dream Job appeared first on Towards Data Science.

Master the 3D Reconstruction Process: A Step-by-Step Guide

Florent Poux, Ph.D. — Fri, 28 Mar 2025 20:25:57 +0000

The 3d Reconstruction journey from 2D photographs to 3D models follows a structured path.

This path consists of distinct steps that build upon each other to transform flat images into spatial information.

Understanding this pipeline is crucial for anyone looking to create high-quality 3D reconstructions.

Let me explain…

Most people think 3D reconstruction means:

Taking random photos around an object
Pressing a button in expensive software
Waiting for magic to happen
Getting perfect results every time
Skipping the fundamentals

No thanks.

The most successful 3D Reconstruction I have seen are built on three core principles:

They use pipelines that work with fewer images but position them better.
They make sure users spend less time processing but achieve cleaner results.
They permit troubleshooting faster because users know exactly where to look.

Therefore, this hints at a nice lesson:

Your 3D models can only be as good as your understanding of how they’re created.

Looking at this from a scientific perspective is really key.

Let us dive right into it!

If you are new to my (3D) writing world, welcome! We are going on an exciting adventure that will allow you to master an essential 3D Python skill.

Once the scene is laid out, we embark on the Python journey. Everything is provided, included resources at the end. You will see Tips (Notes and Growing) to help you get the most out of this article. Thanks to the 3D Geodata Academy for supporting the endeavor. This article is inspired by a small section of Module 1 of the 3D Reconstructor OS Course.

The Complete 3D Reconstruction Workflow

Let me highlight the 3D Reconstruction pipeline with Photogrammetry. The process follows a logical sequence of steps, as illustrated below.

What is important to note, is that each step builds upon the previous one. Therefore, the quality of each stage directly impacts the final result, which is very important to have in mind!

Understanding the entire process is crucial for troubleshooting workflows due to its sequential nature.

With that in mind, let’s detail each step, focusing on both the theory and practical implementation.

Natural Feature Extraction: Finding the Distinctive Points

Natural feature extraction is the foundation of the photogrammetry process. It identifies distinctive points in images that can be reliably located across multiple photographs.

These points serve as anchors that tie different views together.

When working with low-texture objects, consider adding temporary markers or texture patterns to improve feature extraction results.

Common feature extraction algorithms include:

Algorithm	Strengths	Weaknesses	Best For
SIFT	Scale and rotation invariant	Computationally expensive	High-quality, general-purpose reconstruction
SURF	Faster than SIFT	Less accurate than SIFT	Quick prototyping
ORB	Very fast, no patent restrictions	Less robust to viewpoint changes	Real-time applications

Let’s implement a simple feature extraction using OpenCV:

#%% SECTION 1: Natural Feature Extraction
import cv2
import numpy as np
import matplotlib.pyplot as plt

def extract_features(image_path, feature_method='sift', max_features=2000):
    """
    Extract features from an image using different methods.
    """

    # Read the image in color and convert to grayscale
    img = cv2.imread(image_path)
    if img is None:
        raise ValueError(f"Could not read image at {image_path}")
    
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # Initialize feature detector based on method
    if feature_method.lower() == 'sift':
        detector = cv2.SIFT_create(nfeatures=max_features)
    elif feature_method.lower() == 'surf':
        # Note: SURF is patented and may not be available in all OpenCV distributions
        detector = cv2.xfeatures2d.SURF_create(400)  # Adjust threshold as needed
    elif feature_method.lower() == 'orb':
        detector = cv2.ORB_create(nfeatures=max_features)
    else:
        raise ValueError(f"Unsupported feature method: {feature_method}")
    
    # Detect and compute keypoints and descriptors
    keypoints, descriptors = detector.detectAndCompute(gray, None)
    
    # Create visualization
    img_with_features = cv2.drawKeypoints(
        img, keypoints, None, 
        flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS
    )
    
    print(f"Extracted {len(keypoints)} {feature_method.upper()} features")
    
    return keypoints, descriptors, img_with_features

image_path = "sample_image.jpg"  # Replace with your image path

# Extract features with different methods
kp_sift, desc_sift, vis_sift = extract_features(image_path, 'sift')
kp_orb, desc_orb, vis_orb = extract_features(image_path, 'orb')

What I do here is run through an image, and hunt for distinctive patterns that stand out from their surroundings.

These patterns create mathematical “signatures” called descriptors that remain recognizable even when viewed from different angles or distances.

Think of them as unique fingerprints that can be matched across multiple photographs.

The visualization step reveals exactly what the algorithm finds important in your image.

# Display results
plt.figure(figsize=(12, 6))
    
plt.subplot(1, 2, 1)
plt.title(f'SIFT Features ({len(kp_sift)})')
plt.imshow(cv2.cvtColor(vis_sift, cv2.COLOR_BGR2RGB))
plt.axis('off')
    
plt.subplot(1, 2, 2)
plt.title(f'ORB Features ({len(kp_orb)})')
plt.imshow(cv2.cvtColor(vis_orb, cv2.COLOR_BGR2RGB))
plt.axis('off')
    
plt.tight_layout()
plt.show()

Notice how corners, edges, and textured areas attract more keypoints, while smooth or uniform regions remain largely ignored.

This visual feedback is invaluable for understanding why some objects reconstruct better than others.

Geeky Note: The max_features parameter is critical. Setting it too high can dramatically slow processing and capture noise, while setting it too low might miss important details. For most objects, 2000-5000 features provide a good balance, but I’ll push it to 10,000+ for highly detailed architectural reconstructions.

Feature Matching: Connecting Images Together

Once features are extracted, the next step is to find correspondences between images. This process identifies which points in different images represent the same physical point in the real world. Feature matching creates the connections needed to determine camera positions.

I’ve seen countless attempts fail because the algorithm couldn’t reliably connect the same points across different images.

The ratio test is the silent hero that weeds out ambiguous matches before they poison your reconstruction.

#%% SECTION 2: Feature Matching
import cv2
import numpy as np
import matplotlib.pyplot as plt

def match_features(descriptors1, descriptors2, method='flann', ratio_thresh=0.75):
    """
    Match features between two images using different methods.
    """

    # Convert descriptors to appropriate type if needed
    if descriptors1 is None or descriptors2 is None:
        return []
    
    if method.lower() == 'flann':
        # FLANN parameters
        if descriptors1.dtype != np.float32:
            descriptors1 = np.float32(descriptors1)
        if descriptors2.dtype != np.float32:
            descriptors2 = np.float32(descriptors2)
            
        FLANN_INDEX_KDTREE = 1
        index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
        search_params = dict(checks=50)  # Higher values = more accurate but slower
        
        flann = cv2.FlannBasedMatcher(index_params, search_params)
        matches = flann.knnMatch(descriptors1, descriptors2, k=2)
    else:  # Brute Force
        # For ORB descriptors
        if descriptors1.dtype == np.uint8:
            bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=False)
        else:  # For SIFT and SURF descriptors
            bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=False)
        
        matches = bf.knnMatch(descriptors1, descriptors2, k=2)
    
    # Apply Lowe's ratio test
    good_matches = []
    for match in matches:
        if len(match) == 2:  # Sometimes fewer than 2 matches are returned
            m, n = match
            if m.distance < ratio_thresh * n.distance:
                good_matches.append(m)
    
    return good_matches

def visualize_matches(img1, kp1, img2, kp2, matches, max_display=100):
    """
    Create a visualization of feature matches between two images.
    """

    # Limit the number of matches to display
    matches_to_draw = matches[:min(max_display, len(matches))]
    
    # Create match visualization
    match_img = cv2.drawMatches(
        img1, kp1, img2, kp2, matches_to_draw, None,
        flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
    )
    
    return match_img

# Load two images
img1_path = "image1.jpg"  # Replace with your image paths
img2_path = "image2.jpg"
    
# Extract features using SIFT (or your preferred method)
kp1, desc1, _ = extract_features(img1_path, 'sift')
kp2, desc2, _ = extract_features(img2_path, 'sift')
    
# Match features
good_matches = match_features(desc1, desc2, method='flann')
    
print(f"Found {len(good_matches)} good matches")

The matching process works by comparing feature descriptors between two images, measuring their mathematical similarity. For each feature in the first image, we find its two closest matches in the second image and assess their relative distances.

If the closest match is significantly better than the second-best (as controlled by the ratio threshold), we consider it reliable.

# Visualize matches
img1 = cv2.imread(img1_path)
img2 = cv2.imread(img2_path)
match_visualization = visualize_matches(img1, kp1, img2, kp2, good_matches)
    
plt.figure(figsize=(12, 8))
plt.imshow(cv2.cvtColor(match_visualization, cv2.COLOR_BGR2RGB))
plt.title(f"Feature Matches: {len(good_matches)}")
plt.axis('off')
plt.tight_layout()
plt.show()

Visualizing these matches reveals the spatial relationships between your images.

Good matches form a consistent pattern that reflects the transform between viewpoints, while outliers appear as random connections.

This pattern provides immediate feedback on image quality and camera positioning—clustered, consistent matches suggest good reconstruction potential.

Geeky Note: The ratio_thresh parameter (0.75) is Lowe’s original recommendation and works well in most situations. Lower values (0.6-0.7) produce fewer but more reliable matches, which is preferable for scenes with repetitive patterns. Higher values (0.8-0.9) yield more matches but increase the risk of outliers contaminating your reconstruction.

Beautiful, now, let us move at the main stage: the Structure from Motion node.

Structure From Motion: Placing Cameras in Space

Structure from Motion (SfM) reconstructs both the 3D scene structure and camera motion from the 2D image correspondences. This process determines where each photo was taken from and creates an initial sparse point cloud of the scene.

Key steps in SfM include:

Estimating the fundamental or essential matrix between image pairs
Recovering camera poses (position and orientation)
Triangulating 3D points from 2D correspondences
Building a track graph to connect observations across multiple images

The essential matrix encodes the geometric relationship between two camera viewpoints, revealing how they’re positioned relative to each other in space.

This mathematical relationship is the foundation for reconstructing both the camera positions and the 3D structure they observed.

#%% SECTION 3: Structure from Motion
import cv2
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def estimate_pose(kp1, kp2, matches, K, method=cv2.RANSAC, prob=0.999, threshold=1.0):
    """
    Estimate the relative pose between two cameras using matched features.
    """

    # Extract matched points
    pts1 = np.float32([kp1[m.queryIdx].pt for m in matches])
    pts2 = np.float32([kp2[m.trainIdx].pt for m in matches])
    
    # Estimate essential matrix
    E, mask = cv2.findEssentialMat(pts1, pts2, K, method, prob, threshold)
    
    # Recover pose from essential matrix
    _, R, t, mask = cv2.recoverPose(E, pts1, pts2, K, mask=mask)
    
    inlier_matches = [matches[i] for i in range(len(matches)) if mask[i] > 0]
    print(f"Estimated pose with {np.sum(mask)} inliers out of {len(matches)} matches")
    
    return R, t, mask, inlier_matches

def triangulate_points(kp1, kp2, matches, K, R1, t1, R2, t2):
    """
    Triangulate 3D points from two views.
    """

    # Extract matched points
    pts1 = np.float32([kp1[m.queryIdx].pt for m in matches])
    pts2 = np.float32([kp2[m.trainIdx].pt for m in matches])
    
    # Create projection matrices
    P1 = np.dot(K, np.hstack((R1, t1)))
    P2 = np.dot(K, np.hstack((R2, t2)))
    
    # Triangulate points
    points_4d = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)
    
    # Convert to 3D points
    points_3d = points_4d[:3] / points_4d[3]
    
    return points_3d.T

def visualize_points_and_cameras(points_3d, R1, t1, R2, t2):
    """
    Visualize 3D points and camera positions.
    """

    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    
    # Plot points
    ax.scatter(points_3d[:, 0], points_3d[:, 1], points_3d[:, 2], c='b', s=1)
    
    # Helper function to create camera visualization
    def plot_camera(R, t, color):
        # Camera center
        center = -R.T @ t
        ax.scatter(center[0], center[1], center[2], c=color, s=100, marker='o')
        
        # Camera axes (showing orientation)
        axes_length = 0.5  # Scale to make it visible
        for i, c in zip(range(3), ['r', 'g', 'b']):
            axis = R.T[:, i] * axes_length
            ax.quiver(center[0], center[1], center[2], 
                      axis[0], axis[1], axis[2], 
                      color=c, arrow_length_ratio=0.1)
    
    # Plot cameras
    plot_camera(R1, t1, 'red')
    plot_camera(R2, t2, 'green')
    
    ax.set_title('3D Reconstruction: Points and Cameras')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    
    # Try to make axes equal
    max_range = np.max([
        np.max(points_3d[:, 0]) - np.min(points_3d[:, 0]),
        np.max(points_3d[:, 1]) - np.min(points_3d[:, 1]),
        np.max(points_3d[:, 2]) - np.min(points_3d[:, 2])
    ])
    
    mid_x = (np.max(points_3d[:, 0]) + np.min(points_3d[:, 0])) * 0.5
    mid_y = (np.max(points_3d[:, 1]) + np.min(points_3d[:, 1])) * 0.5
    mid_z = (np.max(points_3d[:, 2]) + np.min(points_3d[:, 2])) * 0.5
    
    ax.set_xlim(mid_x - max_range * 0.5, mid_x + max_range * 0.5)
    ax.set_ylim(mid_y - max_range * 0.5, mid_y + max_range * 0.5)
    ax.set_zlim(mid_z - max_range * 0.5, mid_z + max_range * 0.5)
    
    plt.tight_layout()
    plt.show()

Geeky Note: The RANSAC threshold parameter (threshold=1.0) determines how strict we are about geometric consistency. I’ve found that 0.5-1.0 works well for controlled environments, but increasing to 1.5-2.0 helps with outdoor scenes where wind might cause slight camera movements. The probability parameter (prob=0.999) ensures high confidence but increases computation time; 0.95 is sufficient for prototyping.

The essential matrix estimation uses matched feature points and the camera’s internal parameters to calculate the geometric relationship between images.

This relationship is then decomposed to extract rotation and translation information – essentially determining where each photo was taken from in 3D space. The accuracy of this step directly affects everything that follows.


# This is a simplified example - in practice you would use images and matches
# from the previous steps
    
# Example camera intrinsic matrix (replace with your calibrated values)
K = np.array([
        [1000, 0, 320],
        [0, 1000, 240],
        [0, 0, 1]
])
    
# For first camera, we use identity rotation and zero translation
R1 = np.eye(3)
t1 = np.zeros((3, 1))
    
# Load images, extract features, and match as in previous sections
img1_path = "image1.jpg"  # Replace with your image paths
img2_path = "image2.jpg"
    
img1 = cv2.imread(img1_path)
img2 = cv2.imread(img2_path)
    
kp1, desc1, _ = extract_features(img1_path, 'sift')
kp2, desc2, _ = extract_features(img2_path, 'sift')
    
matches = match_features(desc1, desc2, method='flann')
    
# Estimate pose of second camera relative to first
R2, t2, mask, inliers = estimate_pose(kp1, kp2, matches, K)
    
# Triangulate points
points_3d = triangulate_points(kp1, kp2, inliers, K, R1, t1, R2, t2)

Once camera positions are established, triangulation projects rays from matched points in multiple images to determine where they intersect in 3D space.

# Visualize the result
visualize_points_and_cameras(points_3d, R1, t1, R2, t2)

These intersections form the initial sparse point cloud, providing the skeleton upon which dense reconstruction will later build. The visualization shows both the reconstructed points and the camera positions, helping you understand the spatial relationships in your dataset.

SfM works best with a good network of overlapping images. Aim for at least 60% overlap between adjacent images for reliable reconstruction.

Bundle Adjustment: Optimizing for Accuracy

There is an extra optimization stage that comes in within the Structure from Motion “compute node”.

This is called: Bundle adjustment.

It is a refinement step that jointly optimizes camera parameters and 3D point positions. What that means, is that it minimizes the reprojection error, i.e. the difference between observed image points and the projection of their corresponding 3D points.

Does this make sense to you? Essentially, this optimization is great as it permits to:

improves the accuracy of the reconstruction
correct for accumulated drift
Ensures global consistency of the model

At this stage, this should be enough to get a good intuition of how it works.

In larger projects, incremental bundle adjustment (optimizing after adding each new camera) can improve both speed and stability compared to global adjustment at the end.

Dense Matching: Creating Detailed Reconstructions

After establishing camera positions and sparse points, the final step is dense matching to create a detailed representation of the scene.

Dense matching uses the known camera parameters to match many more points between images, resulting in a complete point cloud.

Common approaches include:

Multi-View Stereo (MVS)
Patch-based Multi-View Stereo (PMVS)
Semi-Global Matching (SGM)

Putting It All Together: Practical Tools

The theoretical pipeline is implemented in several open-source and commercial software packages. Each offers different features and capabilities:

Tool	Strengths	Use Case	Pricing
COLMAP	Highly accurate, customizable	Research, precise reconstructions	Free, open-source
OpenMVG	Modular, extensive documentation	Education, integration with custom pipelines	Free, open-source
Meshroom	User-friendly, node-based interface	Artists, beginners	Free, open-source
RealityCapture	Extremely fast, high-quality results	Professional, large-scale projects	Commercial

These tools package the various pipeline steps described above into a more user-friendly interface, but understanding the underlying processes is still essential for troubleshooting and optimization.

Automating the reconstruction pipeline saves countless hours of manual work.

The real productivity boost comes from scripting the entire process end-to-end, from raw photos to dense point cloud.

COLMAP’s command-line interface makes this automation possible, even for complex reconstruction tasks.

#%% SECTION 4: Complete Pipeline Automation with COLMAP
import os
import subprocess
import glob
import numpy as np

def run_colmap_pipeline(image_folder, output_folder, colmap_path="colmap"):
    """
    Run the complete COLMAP pipeline from feature extraction to dense reconstruction.
    """

    # Create output directories if they don't exist
    sparse_folder = os.path.join(output_folder, "sparse")
    dense_folder = os.path.join(output_folder, "dense")
    database_path = os.path.join(output_folder, "database.db")
    
    os.makedirs(output_folder, exist_ok=True)
    os.makedirs(sparse_folder, exist_ok=True)
    os.makedirs(dense_folder, exist_ok=True)
    
    # Step 1: Feature extraction
    print("Step 1: Feature extraction")
    feature_cmd = [
        colmap_path, "feature_extractor",
        "--database_path", database_path,
        "--image_path", image_folder,
        "--ImageReader.camera_model", "SIMPLE_RADIAL",
        "--ImageReader.single_camera", "1",
        "--SiftExtraction.use_gpu", "1"
    ]
    
    try:
        subprocess.run(feature_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Feature extraction failed: {e}")
        return False
    
    # Step 2: Match features
    print("Step 2: Feature matching")
    match_cmd = [
        colmap_path, "exhaustive_matcher",
        "--database_path", database_path,
        "--SiftMatching.use_gpu", "1"
    ]
    
    try:
        subprocess.run(match_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Feature matching failed: {e}")
        return False
    
    # Step 3: Sparse reconstruction (Structure from Motion)
    print("Step 3: Sparse reconstruction")
    sfm_cmd = [
        colmap_path, "mapper",
        "--database_path", database_path,
        "--image_path", image_folder,
        "--output_path", sparse_folder
    ]
    
    try:
        subprocess.run(sfm_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Sparse reconstruction failed: {e}")
        return False
    
    # Find the largest sparse model
    sparse_models = glob.glob(os.path.join(sparse_folder, "*/"))
    if not sparse_models:
        print("No sparse models found")
        return False
    
    # Sort by model size (using number of images as proxy)
    largest_model = 0
    max_images = 0
    for i, model_dir in enumerate(sparse_models):
        images_txt = os.path.join(model_dir, "images.txt")
        if os.path.exists(images_txt):
            with open(images_txt, 'r') as f:
                num_images = sum(1 for line in f if line.strip() and not line.startswith("#"))
                num_images = num_images // 2  # Each image has 2 lines
                if num_images > max_images:
                    max_images = num_images
                    largest_model = i
    
    selected_model = os.path.join(sparse_folder, str(largest_model))
    print(f"Selected model {largest_model} with {max_images} images")
    
    # Step 4: Image undistortion
    print("Step 4: Image undistortion")
    undistort_cmd = [
        colmap_path, "image_undistorter",
        "--image_path", image_folder,
        "--input_path", selected_model,
        "--output_path", dense_folder,
        "--output_type", "COLMAP"
    ]
    
    try:
        subprocess.run(undistort_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Image undistortion failed: {e}")
        return False
    
    # Step 5: Dense reconstruction (Multi-View Stereo)
    print("Step 5: Dense reconstruction")
    mvs_cmd = [
        colmap_path, "patch_match_stereo",
        "--workspace_path", dense_folder,
        "--workspace_format", "COLMAP",
        "--PatchMatchStereo.geom_consistency", "true"
    ]
    
    try:
        subprocess.run(mvs_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Dense reconstruction failed: {e}")
        return False
    
    # Step 6: Stereo fusion
    print("Step 6: Stereo fusion")
    fusion_cmd = [
        colmap_path, "stereo_fusion",
        "--workspace_path", dense_folder,
        "--workspace_format", "COLMAP",
        "--input_type", "geometric",
        "--output_path", os.path.join(dense_folder, "fused.ply")
    ]
    
    try:
        subprocess.run(fusion_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Stereo fusion failed: {e}")
        return False
    
    print("Pipeline completed successfully!")
    return True

The script orchestrates a series of COLMAP operations that would normally require manual intervention at each stage. It handles the progression from feature extraction through matching, sparse reconstruction, and finally dense reconstruction – maintaining the correct data flow between steps. This automation becomes invaluable when processing multiple datasets or when iteratively refining reconstruction parameters.

# Replace with your image and output folder paths
image_folder = "path/to/images"
output_folder = "path/to/output"
    
# Path to COLMAP executable (may be just "colmap" if it's in your PATH)
colmap_path = "colmap"
    
run_colmap_pipeline(image_folder, output_folder, colmap_path)

One key aspect is the automatic selection of the largest reconstructed model. In challenging datasets, COLMAP sometimes creates multiple disconnected reconstructions rather than a single cohesive model.

The script intelligently identifies and continues with the most complete reconstruction, using image count as a proxy for model quality and completeness.

Geeky Note: The –SiftExtraction.use_gpu and –SiftMatching.use_gpu flags enable GPU acceleration, speeding up processing by 5-10x. For dense reconstruction, the –PatchMatchStereo.geom_consistency true parameter significantly improves quality by enforcing consistency across multiple views, at the cost of longer processing time.

The Power of Understanding the Pipeline

Understanding the full reconstruction pipeline gives you control over your 3D modeling process. When you encounter issues, knowing which stage might be causing problems allows you to target your troubleshooting efforts effectively.

As illustrated, common issues and their sources include:

Missing or incorrect camera poses: Feature extraction and matching problems
Incomplete reconstruction: Insufficient image overlap
Noisy point clouds: Poor bundle adjustment or camera calibration
Failed reconstruction: Problematic images (motion blur, poor lighting)

The ability to diagnose these issues comes from a deep understanding of how each pipeline component works and interacts with others.

Next Steps: Practice and Automation

Now that you understand the pipeline, it’s time to put it into practice. Experiment with the provided code examples and try automating the process for your own datasets.

Start with small, well-controlled scenes and gradually tackle more complex environments as you gain confidence.

Remember that the quality of your input images dramatically affects the final result. Take time to capture high-quality photographs with good overlap, consistent lighting, and minimal motion blur.

Consider starting a small personal project to reconstruct an object you own. Document your process, including the issues you encounter and how you solve them – this practical experience is invaluable.

If you want to build proper expertise, consider
the 3D Reconstructor OS Course ,
or 3D Data Science with Python (O’Reilly)

References and useful resources

I compiled for you some interesting software, tools, and useful algorithm extended documentation:

Software and Tools

COLMAP – Free, open-source 3D reconstruction software
OpenMVG – Open Multiple View Geometry library
Meshroom – Free node-based photogrammetry software
RealityCapture – Commercial high-performance photogrammetry software
Agisoft Metashape – Commercial photogrammetry and 3D modeling software
OpenCV – Computer vision library with feature detection implementations
3DF Zephyr – Photogrammetry software for 3D reconstruction
Python – Programming language ideal for 3D reconstruction automation

Algorithms

SIFT (Scale-Invariant Feature Transform) – Robust feature detection algorithm
SURF (Speeded-Up Robust Features) – Fast feature detection algorithm
ORB (Oriented FAST and Rotated BRIEF) – Efficient alternative to SIFT and SURF
RANSAC (Random Sample Consensus) – Used for outlier rejection in matching
Structure from Motion (SfM) – Algorithm for recovering 3D structure from 2D images
Multi-View Stereo (MVS) – Dense reconstruction algorithm
Bundle Adjustment – Optimization technique for camera poses and 3D points
FLANN (Fast Library for Approximate Nearest Neighbors) – Fast matching algorithm for feature descriptors

About the author

Florent Poux, Ph.D. is a Scientific and Course Director focused on educating engineers on leveraging AI and 3D Data Science. He leads research teams and teaches 3D Computer Vision at various universities. His current aim is to ensure humans are correctly equipped with the knowledge and skills to tackle 3D challenges for impactful innovations.

Resources

The post Master the 3D Reconstruction Process: A Step-by-Step Guide appeared first on Towards Data Science.