Machine Learning | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Avishek Biswas — Sat, 12 Apr 2025 01:09:27 +0000

Recently, Sesame AI published a demo of their latest Speech-to-Speech model. A conversational AI agent who is really good at speaking, they provide relevant answers, they speak with expressions, and honestly, they are just very fun and interactive to play with.

Note that a technical paper is not out yet, but they do have a short blog post that provides a lot of information about the techniques they used and previous algorithms they built upon.

Thankfully, they provided enough information for me to write this article and make a YouTube video out of it. Read on!

Training a Conversational Speech Model

Sesame is a Conversational Speech Model, or a CSM. It inputs both text and audio, and generates speech as audio. While they haven’t revealed their training data sources in the articles, we can still try to take a solid guess. The blog post heavily cites another CSM, 2024’s Moshi, and fortunately, the creators of Moshi did reveal their data sources in their paper. Moshi uses 7 million hours of unsupervised speech data, 170 hours of natural and scripted conversations (for multi-stream training), and 2000 more hours of telephone conversations (The Fischer Dataset).

Sesame builds upon the Moshi Paper (2024)

But what does it really take to generate audio?

In raw form, audio is just a long sequence of amplitude values — a waveform. For example, if you’re sampling audio at 24 kHz, you are capturing 24,000 float values every second.

There are 24000 values here to represent 1 second of speech! (Image generated by author)

Of course, it is quite resource-intensive to process 24000 float values for just one second of data, especially because transformer computations scale quadratically with sequence length. It would be great if we could compress this signal and reduce the number of samples required to process the audio.

We will take a deep dive into the Mimi encoder and specifically Residual Vector Quantizers (RVQ), which are the backbone of Audio/Speech modeling in Deep Learning today. We will end the article by learning about how Sesame generates audio using its special dual-transformer architecture.

Preprocessing audio

Compression and feature extraction are where convolution helps us. Sesame uses the Mimi speech encoder to process audio. Mimi was introduced in the aforementioned Moshi paper as well. Mimi is a self-supervised audio encoder-decoder model that converts audio waveforms into discrete “latent” tokens first, and then reconstructs the original signal. Sesame only uses the encoder section of Mimi to tokenize the input audio tokens. Let’s learn how.

Mimi inputs the raw speech waveform at 24Khz, passes them through several strided convolution layers to downsample the signal, with a stride factor of 4, 5, 6, 8, and 2. This means that the first CNN block downsamples the audio by 4x, then 5x, then 6x, and so on. In the end, it downsamples by a factor of 1920, reducing it to just 12.5 frames per second.

The convolution blocks also project the original float values to an embedding dimension of 512. Each embedding aggregates the local features of the original 1D waveform. 1 second of audio is now represented as around 12 vectors of size 512. This way, Mimi reduces the sequence length from 24000 to just 12 and converts them into dense continuous vectors.

Before applying any quantization, the Mimi Encoder downsamples the input 24KHz audio by 1920 times, and embeds it into 512 dimensions. In other words, you get 12.5 frames per second with each frame as a 512-dimensional vector. (Image from author’s video)

What is Audio Quantization?

Given the continuous embeddings obtained after the convolution layer, we want to tokenize the input speech. If we can represent speech as a sequence of tokens, we can apply standard language learning transformers to train generative models.

Mimi uses a Residual Vector Quantizer or RVQ tokenizer to achieve this. We will talk about the residual part soon, but first, let’s look at what a simple vanilla Vector quantizer does.

Vector Quantization

The idea behind Vector Quantization is simple: you train a codebook , which is a collection of, say, 1000 random vector codes all of size 512 (same as your embedding dimension).

A Vanilla Vector Quantizer. A codebook of embeddings is trained. Given an input embedding, we map/quantize it to the nearest codebook entry. (Screenshot from author’s video)

Then, given the input vector, we will map it to the closest vector in our codebook — basically snapping a point to its nearest cluster center. This means we have effectively created a fixed vocabulary of tokens to represent each audio frame, because whatever the input frame embedding may be, we will represent it with the nearest cluster centroid. If you want to learn more about Vector Quantization, check out my video on this topic where I go much deeper with this.

More about Vector Quantization! (Video by author)

Residual Vector Quantization

The problem with simple vector quantization is that the loss of information may be too high because we are mapping each vector to its cluster’s centroid. This “snap” is rarely perfect, so there is always an error between the original embedding and the nearest codebook.

The big idea of Residual Vector Quantization is that it doesn’t stop at having just one codebook. Instead, it tries to use multiple codebooks to represent the input vector.

First, you quantize the original vector using the first codebook.
Then, you subtract that centroid from your original vector. What you’re left with is the residual — the error that wasn’t captured in the first quantization.
Now take this residual, and quantize it again, using a second codebook full of brand new code vectors — again by snapping it to the nearest centroid.
Subtract that too, and you get a smaller residual. Quantize again with a third codebook… and you can keep doing this for as many codebooks as you want.

Residual Vector Quantizers (RVQ) hierarchically encode the input embeddings by using a new codebook and VQ layer to represent the previous codebook’s error. (Illustration by the author)

Each step hierarchically captures a little more detail that was missed in the previous round. If you repeat this for, let’s say, N codebooks, you get a collection of N discrete tokens from each stage of quantization to represent one audio frame.

The coolest thing about RVQs is that they are designed to have a high inductive bias towards capturing the most essential content in the very first quantizer. In the subsequent quantizers, they learn more and more fine-grained features.

If you’re familiar with PCA, you can think of the first codebook as containing the primary principal components, capturing the most critical information. The subsequent codebooks represent higher-order components, containing information that adds more details.

Residual Vector Quantizers (RVQ) uses multiple codebooks to encode the input vector — one entry from each codebook. (Screenshot from author’s video)

Acoustic vs Semantic Codebooks

Since Mimi is trained on the task of audio reconstruction, the encoder compresses the signal to the discretized latent space, and the decoder reconstructs it back from the latent space. When optimizing for this task, the RVQ codebooks learn to capture the essential acoustic content of the input audio inside the compressed latent space.

Mimi also separately trains a single codebook (vanilla VQ) that only focuses on embedding the semantic content of the audio. This is why Mimi is called a split-RVQ tokenizer – it divides the quantization process into two independent parallel paths: one for semantic information and another for acoustic information.

The Mimi Architecture (Source: Moshi paper) License: Free

To train semantic representations, Mimi used knowledge distillation with an existing speech model called WavLM as a semantic teacher. Basically, Mimi introduces an additional loss function that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.

Audio Decoder

Given a conversation containing text and audio, we first convert them into a sequence of token embeddings using the text and audio tokenizers. This token sequence is then input into a transformer model as a time series. In the blog post, this model is referred to as the Autoregressive Backbone Transformer. Its task is to process this time series and output the “zeroth” codebook token.

A lighterweight transformer called the audio decoder then reconstructs the next codebook tokens conditioned on this zeroth code generated by the backbone transformer. Note that the zeroth code already contains a lot of information about the history of the conversation since the backbone transformer has visibility of the entire past sequence. The lightweight audio decoder only operates on the zeroth token and generates the other N-1 codes. These codes are generated by using N-1 distinct linear layers that output the probability of choosing each code from their corresponding codebooks.

You can imagine this process as predicting a text token from the vocabulary in a text-only LLM. Just that a text-based LLM has a single vocabulary, but the RVQ-tokenizer has multiple vocabularies in the form of the N codebooks, so you need to train a separate linear layer to model the codes for each.

The Sesame Architecture (Illustration by the author)

Finally, after the codewords are all generated, we aggregate them to form the combined continuous audio embedding. The final job is to convert this audio back to a waveform. For this, we apply transposed convolutional layers to upscale the embedding back from 12.5 Hz back to KHz waveform audio. Basically, reversing the transforms we had applied originally during audio preprocessing.

In Summary

Check out the accompanying video on this article! (Video by author)

So, here is the overall summary of the Sesame model in some bullet points.

Sesame is built on a multimodal Conversation Speech Model or a CSM.
Text and audio are tokenized together to form a sequence of tokens and input into the backbone transformer that autoregressively processes the sequence.
While the text is processed like any other text-based LLM, the audio is processed directly from its waveform representation. They use the Mimi encoder to convert the waveform into latent codes using a split RVQ tokenizer.
The multimodal backbone transformers consume a sequence of tokens and predict the next zeroth codeword.
Another lightweight transformer called the Audio Decoder predicts the next codewords from the zeroth codeword.
The final audio frame representation is generated from combining all the generated codewords and upsampled back to the waveform representation.

Thanks for reading!

References and Must-read papers

Check out my ML YouTube Channel

Sesame Blogpost and Demo

Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692

The post Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 6: The Human Side

David Martin — Fri, 11 Apr 2025 18:44:39 +0000

In my previous articles, I have spent a lot of time talking about the technical aspects of an Image Classification problem from data collection, model evaluation, performance optimization, and a detailed look at model training.

These elements require a certain degree of in-depth expertise, and they (usually) have well-defined metrics and established processes that are within our control.

Now it’s time to consider…

The human aspects of machine learning

Yes, this may seem like an oxymoron! But it is the interaction with people — the ones you work with and the ones who use your application — that help bring the technology to life and provide a sense of fulfillment to your work.

These human interactions include:

Communicating technical concepts to a non-technical audience.
Understanding how your end-users engage with your application.
Providing clear expectations on what the model can and cannot do.

I also want to touch on the impact to people’s jobs, both positive and negative, as AI becomes a part of our everyday lives.

Overview

As in my previous articles, I will gear this discussion around an image classification application. With that in mind, these are the groups of people involved with your project:

AI/ML Engineer (that’s you) — bringing life to the Machine Learning application.
MLOps team — your peers who will deploy, monitor, and enhance your application.
Subject matter experts — the ones who will provide the care and feeding of labeled data.
Stakeholders — the ones who are looking for a solution to a real world problem.
End-users — the ones who will be using your application. These could be internal and external customers.
Marketing — the ones who will be promoting usage of your application.
Leadership — the ones who are paying the bill and need to see business value.

Let’s dive right in…

AI/ML Engineer

You may be a part of a team or a lone wolf. You may be an individual contributor or a team leader.

Photo by Christina @ wocintechchat.com on Unsplash

Whatever your role, it is important to see the whole picture — not only the coding, the data science, and the technology behind AI/ML — but the value that it brings to your organization.

Understand the business needs

Your company faces many challenges to reduce expenses, improve customer satisfaction, and remain profitable. Position yourself as someone who can create an application that helps achieve their goals.

What are the pain points in a business process?
What is the value of using your application (time savings, cost savings)?
What are the risks of a poor implementation?
What is the roadmap for future enhancements and use-cases?
What other areas of the business could benefit from the application, and what design choices will help future-proof your work?

Communication

Deep technical discussions with your peers is probably our comfort zone. However, to be a more successful AI/ML Engineer, you should be able to clearly explain the work you are doing to different audiences.

With practice, you can explain these topics in ways that your non-technical business users can follow along with, and understand how your technology will benefit them.

To help you get comfortable with this, try creating a PowerPoint with 2–3 slides that you can cover in 5–10 minutes. For example, explain how a neural network can take an image of a cat or a dog and determine which one it is.

Practice giving this presentation in your mind, to a friend — even your pet dog or cat! This will get you more comfortable with the transitions, tighten up the content, and ensure you cover all the important points as clearly as possible.

Be sure to include visuals — pure text is boring, graphics are memorable.
Keep an eye on time — respect your audience’s busy schedule and stick to the 5–10 minutes you are given.
Put yourself in their shoes — your audience is interested in how the technology will benefit them, not on how smart you are.

Creating a technical presentation is a lot like the Feynman Technique — explaining a complex subject to your audience by breaking it into easily digestible pieces, with the added benefit of helping you understand it more completely yourself.

MLOps team

These are the people that deploy your application, manage data pipelines, and monitor infrastructure that keeps things running.

Without them, your model lives in a Jupyter notebook and helps nobody!

Photo by airfocus on Unsplash

These are your technical peers, so you should be able to connect with their skillset more naturally. You speak in jargon that sounds like a foreign language to most people. Even so, it is extremely helpful for you to create documentation to set expectations around:

Process and data flows.
Data quality standards.
Service level agreements for model performance and availability.
Infrastructure requirements for compute and storage.
Roles and responsibilities.

It is easy to have a more informal relationship with your MLOps team, but remember that everyone is trying to juggle many projects at the same time.

Email and chat messages are fine for quick-hit issues. But for larger tasks, you will want a system to track things like user stories, enhancement requests, and break-fix issues. This way you can prioritize the work and ensure you don’t forget something. Plus, you can show progress to your supervisor.

Some great tools exist, such as:

Jira, GitHub, Azure DevOps Boards, Asana, Monday, etc.

We are all professionals, so having a more formal system to avoid miscommunication and mistrust is good business.

Subject matter experts

These are the team members that have the most experience working with the data that you will be using in your AI/ML project.

Photo by National Cancer Institute on Unsplash

SMEs are very skilled at dealing with messy data — they are human, after all! They can handle one-off situations by considering knowledge outside of their area of expertise. For example, a doctor may recognize metal inserts in a patient’s X-ray that indicate prior surgery. They may also notice a faulty X-ray image due to equipment malfunction or technician error.

However, your machine learning model only knows what it knows, which comes from the data it was trained on. So, those one-off cases may not be appropriate for the model you are training. Your SMEs need to understand that clear, high quality training material is what you are looking for.

Think like a computer

In the case of an image classification application, the output from the model communicates to you how well it was trained on the data set. This comes in the form of error rates, which is very much like when a student takes an exam and you can tell how well they studied by seeing how many questions — and which ones — they get wrong.

In order to reduce error rates, your image data set needs to be objectively “good” training material. To do this, put yourself in an analytical mindset and ask yourself:

What images will the computer get the most useful information out of? Make sure all the relevant features are visible.
What is it about an image that confused the model? When it makes an error, try to understand why — objectively — by looking at the entire picture.
Is this image a “one-off” or a typical example of what the end-users will send? Consider creating a new subclass of exceptions to the norm.

Be sure to communicate to your SMEs that model performance is directly tied to data quality and give them clear guidance:

Provide visual examples of what works.
Provide counter-examples of what does not work.
Ask for a wide variety of data points. In the X-ray example, be sure to get patients with different ages, genders, and races.
Provide options to create subclasses of your data for further refinement. Use that X-ray from a patient with prior surgery as a subclass, and eventually as you can get more examples over time, the model can handle them.

This also means that you should become familiar with the data they are working with — perhaps not expert level, but certainly above a novice level.

Lastly, when working with SMEs, be cognizant of the impression they may have that the work you are doing is somehow going to replace their job. It can feel threatening when someone asks you how to do your job, so be mindful.

Ideally, you are building a tool with honest intentions and it will enable your SMEs to augment their day-to-day work. If they can use the tool as a second opinion to validate their conclusions in less time, or perhaps even avoid mistakes, then this is a win for everyone. Ultimately, the goal is to allow them to focus on more challenging situations and achieve better outcomes.

I have more to say on this in my closing remarks.

Stakeholders

These are the people you will have the closest relationship with.

Stakeholders are the ones who created the business case to have you build the machine learning model in the first place.

Photo by Ninthgrid on Unsplash

They have a vested interest in having a model that performs well. Here are some key point when working with your stakeholder:

Be sure to listen to their needs and requirements.
Anticipate their questions and be prepared to respond.
Be on the lookout for opportunities to improve your model performance. Your stakeholders may not be as close to the technical details as you are and may not think there is any room for improvement.
Bring issues and problems to their attention. They may not want to hear bad news, but they will appreciate honesty over evasion.
Schedule regular updates with usage and performance reports.
Explain technical details in terms that are easy to understand.
Set expectations on regular training and deployment cycles and timelines.

Your role as an AI/ML Engineer is to bring to life the vision of your stakeholders. Your application is making their lives easier, which justifies and validates the work you are doing. It’s a two-way street, so be sure to share the road.

End-users

These are the people who are using your application. They may also be your harshest critics, but you may never even hear their feedback.

Photo by Alina Ruf on Unsplash

Think like a human

Recall above when I suggested to “think like a computer” when analyzing the data for your training set. Now it’s time to put yourself in the shoes of a non-technical user of your application.

End-users of an image classification model communicate their understanding of what’s expected of them by way of poor images. These are like the students that didn’t study for the exam, or worse didn’t read the questions, so their answers don’t make sense.

Your model may be really good, but if end-users misuse the application or are not satisfied with the output, you should be asking:

Are the instructions confusing or misleading? Did the user focus the camera on the subject being classified, or is it more of a wide-angle image? You can’t blame the user if they follow bad instructions.
What are their expectations? When the results are presented to the user, are they satisfied or are they frustrated? You may noticed repeated images from frustrated users.
Are the usage patterns changing? Are they trying to use the application in unexpected ways? This may be an opportunity to improve the model.

Inform your stakeholders of your observations. There may be simple fixes to improve end-user satisfaction, or there may be more complex work ahead.

If you are lucky, you may discover an unexpected way to leverage the application that leads to expanded usage or exciting benefits to your business.

Explainability

Most AI/ML model are considered “black boxes” that perform millions of calculations on extremely high dimensional data and produce a rather simplistic result without any reason behind it.

The Answer to Ultimate Question of Life, the Universe, and Everything is 42.
— The Hitchhikers Guide to the Galaxy

Depending on the situation, your end-users may require more explanation of the results, such as with medical imaging. Where possible, you should consider incorporating model explainability techniques such as LIME, SHAP, and others. These responses can help put a human touch to cold calculations.

Now it’s time to switch gears and consider higher-ups in your organization.

Marketing team

These are the people who promote the use of your hard work. If your end-users are completely unaware of your application, or don’t know where to find it, your efforts will go to waste.

The marketing team controls where users can find your app on your website and link to it through social media channels. They also see the technology through a different lens.

Gartner hype cycle. Image from Wikipedia – https://en.wikipedia.org/wiki/Gartner_hype_cycle

The above hype cycle is a good representation of how technical advancements tends to flow. At the beginning, there can be an unrealistic expectation of what your new AI/ML tool can do — it’s the greatest thing since sliced bread!

Then the “new” wears off and excitement wanes. You may face a lack of interest in your application and the marketing team (as well as your end-users) move on to the next thing. In reality, the value of your efforts are somewhere in the middle.

Understand that the marketing team’s interest is in promoting the use of the tool because of how it will benefit the organization. They may not need to know the technical inner workings. But they should understand what the tool can do, and be aware of what it cannot do.

Honest and clear communication up-front will help smooth out the hype cycle and keep everyone interested longer. This way the crash from peak expectations to the trough of disillusionment is not so severe that the application is abandoned altogether.

Leadership team

These are the people that authorize spending and have the vision for how the application fits into the overall company strategy. They are driven by factors that you have no control over and you may not even be aware of. Be sure to provide them with the key information about your project so they can make informed decisions.

Photo by Adeolu Eletu on Unsplash

Depending on your role, you may or may not have direct interaction with executive leadership in your company. Your job is to summarize the costs and benefits associated with your project, even if that is just with your immediate supervisor who will pass this along.

Your costs will likely include:

Compute and storage — training and serving a model.
Image data collection — both real-world and synthetic or staged.
Hours per week — SME, MLOps, AI/ML engineering time.

Highlight the savings and/or value added:

Provide measures on speed and accuracy.
Translate efficiencies into FTE hours saved and customer satisfaction.
Bonus points if you can find a way to produce revenue.

Business leaders, much like the marketing team, may follow the hype cycle:

Be realistic about model performance. Don’t try to oversell it, but be honest about the opportunities for improvement.
Consider creating a human benchmark test to measure accuracy and speed for an SME. It is easy to say human accuracy is 95%, but it’s another thing to measure it.
Highlight short-term wins and how they can become long-term success.

Conclusion

I hope you can see that, beyond the technical challenges of creating an AI/ML application, there are many humans involved in a successful project. Being able to interact with these individuals, and meet them where they are in terms of their expectations from the technology, is vital to advancing the adoption of your application.

Photo by Vlad Hilitanu on Unsplash

Key takeaways:

Understand how your application fits into the business needs.
Practice communicating to a non-technical audience.
Collect measures of model performance and report these regularly to your stakeholders.
Expect that the hype cycle could help and hurt your cause, and that setting consistent and realistic expectations will ensure steady adoption.
Be aware that factors outside of your control, such as budgets and business strategy, could affect your project.

And most importantly…

Don’t let machines have all the fun learning!

Human nature gives us the curiosity we need to understand our world. Take every opportunity to grow and expand your skills, and remember that human interaction is at the heart of machine learning.

Closing remarks

Advancements in AI/ML have the potential (assuming they are properly developed) to do many tasks as well as humans. It would be a stretch to say “better than” humans because it can only be as good as the training data that humans provide. However, it is safe to say AI/ML can be faster than humans.

The next logical question would be, “Well, does that mean we can replace human workers?”

This is a delicate topic, and I want to be clear that I am not an advocate of eliminating jobs.

I see my role as an AI/ML Engineer as being one that can create tools that aide in someone else’s job or enhance their ability to complete their work successfully. When used properly, the tools can validate difficult decisions and speed through repetitive tasks, allowing your experts to spend more time on the one-off situations that require more attention.

There may also be new career opportunities, from the care-and-feeding of data, quality assessment, user experience, and even to new roles that leverage the technology in exciting and unexpected ways.

Unfortunately, business leaders may make decisions that impact people’s jobs, and this is completely out of your control. But all is not lost — even for us AI/ML Engineers…

There are things we can do

Be kind to the fellow human beings that we call “coworkers”.
Be aware of the fear and uncertainty that comes with technological advancements.
Be on the lookout for ways to help people leverage AI/ML in their careers and to make their lives better.

This is all part of being human.

The post Learnings from a Machine Learning Engineer — Part 6: The Human Side appeared first on Towards Data Science.

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Salvatore Raieli — Fri, 11 Apr 2025 05:44:46 +0000

Liberating education consists in acts of cognition, not transferrals of information.
Paulo freire

One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing?

Many authors suggest that artificial intelligence models do not possess the same capabilities as humans, especially when it comes to plasticity, flexibility, and adaptation.

One of the aspects that models do not capture are several causal relationships about the external world.

This article discusses these issues:

The parallelism between convolutional neural networks (CNNs) and the human visual cortex
Limitations of CNNs in understanding causal relations and learning abstract concepts
How to make CNNs learn simple causal relations

Is it the same? Is it different?

Convolutional networks (CNNs) [2] are multi-layered neural networks that take images as input and can be used for multiple tasks. One of the most fascinating aspects of CNNs is their inspiration from the human visual cortex [1]:

Hierarchical processing. The visual cortex processes images hierarchically, where early visual areas capture simple features (such as edges, lines, and colors) and deeper areas capture more complex features such as shapes, objects, and scenes. CNN, due to its layered structure, captures edges and textures in the early layers, while layers further down capture parts or whole objects.
Receptive fields. Neurons in the visual cortex respond to stimuli in a specific local region of the visual field (commonly called receptive fields). As we go deeper, the receptive fields of the neurons widen, allowing more spatial information to be integrated. Thanks to pooling steps, the same happens in CNNs.
Feature sharing. Although biological neurons are not identical, similar features are recognized across different parts of the visual field. In CNNs, the various filters scan the entire image, allowing patterns to be recognized regardless of location.
Spatial invariance. Humans can recognize objects even when they are moved, scaled, or rotated. CNNs also possess this property.

The relationship between components of the visual system and CNN. Image source: here

These features have made CNNs perform well in visual tasks to the point of superhuman performance:

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well-trained on the validation images to be better aware of the existence of relevant classes. […] Our result (4.94%) exceeds the reported human-level performance. —source [3]

Although CNNs perform better than humans in several tasks, there are still cases where they fail spectacularly. For example, in a 2024 study [4], AI models failed to generalize image classification. State-of-the-art models perform better than humans for objects on upright poses but fail when objects are on unusual poses.

The right label is on the top of the object, and the AI wrong predicted label is below. Image source: here

In conclusion, our results show that (1) humans are still much more robust than most networks at recognizing objects in unusual poses, (2) time is of the essence for such ability to emerge, and (3) even time-limited humans are dissimilar to deep neural networks. —source [4]

In the study [4], they note that humans need time to succeed in a task. Some tasks require not only visual recognition but also abstractive cognition, which requires time.

The generalization abilities that make humans capable come from understanding the laws that govern relations among objects. Humans recognize objects by extrapolating rules and chaining these rules to adapt to new situations. One of the simplest rules is the “same-different relation”: the ability to define whether two objects are the same or different. This ability develops rapidly during infancy and is also importantly associated with language development [5-7]. In addition, some animals such as ducks and chimpanzees also have it [8]. In contrast, learning same-different relations is very difficult for neural networks [9-10].

Example of a same-different task for a CNN. The network should return a label of 1 if the two objects are the same or a label of 0 if they are different. Image source: here

Convolutional networks show difficulty in learning this relationship. Likewise, they fail to learn other types of causal relationships that are simple for humans. Therefore, many researchers have concluded that CNNs lack the inductive bias necessary to be able to learn these relationships.

These negative results do not mean that neural networks are completely incapable of learning same-different relations. Much larger and longer trained models can learn this relation. For example, vision-transformer models pre-trained on ImageNet with contrastive learning can show this ability [12].

Can CNNs learn same-different relationships?

The fact that broad models can learn these kinds of relationships has rekindled interest in CNNs. The same-different relationship is considered among the basic logical operations that make up the foundations for higher-order cognition and reasoning. Showing that shallow CNNs can learn this concept would allow us to experiment with other relationships. Moreover, it will allow models to learn increasingly complex causal relationships. This is an important step in advancing the generalization capabilities of AI.

Previous work suggests that CNNs do not have the architectural inductive biases to be able to learn abstract visual relations. Other authors assume that the problem is in the training paradigm. In general, the classical gradient descent is used to learn a single task or a set of tasks. Given a task t or a set of tasks T, a loss function L is used to optimize the weights φ that should minimize the function L:

Image source from here

This can be viewed as simply the sum of the losses across different tasks (if we have more than one task). Instead, the Model-Agnostic Meta-Learning (MAML) algorithm [13] is designed to search for an optimal point in weight space for a set of related tasks. MAML seeks to find an initial set of weights θ that minimizes the loss function across tasks, facilitating rapid adaptation:

Image source from here

The difference may seem small, but conceptually, this approach is directed toward abstraction and generalization. If there are multiple tasks, traditional training tries to optimize weights for different tasks. MAML tries to identify a set of weights that is optimal for different tasks but at the same time equidistant in the weight space. This starting point θ allows the model to generalize more effectively across different tasks.

Meta-learning initial weights for generalization. Image source from here

Since we now have a method biased toward generalization and abstraction, we can test whether we can make CNNs learn the same-different relationship.

In this study [11], they compared shallow CNNs trained with classic gradient descent and meta-learning on a dataset designed for this report. The dataset consists of 10 different tasks that test for the same-different relationship.

The Same-Different dataset. Image source from here

The authors [11] compare CNNs of 2, 4, or 6 layers trained in a traditional way or with meta-learning, showing several interesting results:

The performance of traditional CNNs shows similar behavior to random guessing.
Meta-learning significantly improves performance, suggesting that the model can learn the same-different relationship. A 2-layer CNN performs little better than chance, but by increasing the depth of the network, performance improves to near-perfect accuracy.

Comparison between traditional training and meta-learning for CNNs. Image source from here

One of the most intriguing results of [11] is that the model can be trained in a leave-one-out way (use 9 tasks and leave one out) and show out-of-distribution generalization capabilities. Thus, the model has learned abstracting behavior that is hardly seen in such a small model (6 layers).

out-of-distribution for same-different classification. Image source from here

Conclusions

Although convolutional networks were inspired by how the human brain processes visual stimuli, they do not capture some of its basic capabilities. This is especially true when it comes to causal relations or abstract concepts. Some of these relationships can be learned from large models only with extensive training. This has led to the assumption that small CNNs cannot learn these relations due to a lack of architecture inductive bias. In recent years, efforts have been made to create new architectures that could have an advantage in learning relational reasoning. Yet most of these architectures fail to learn these kinds of relationships. Intriguingly, this can be overcome through the use of meta-learning.

The advantage of meta-learning is to incentivize more abstractive learning. Meta-learning pressure toward generalization, trying to optimize for all tasks at the same time. To do this, learning more abstract features is favored (low-level features, such as the angles of a particular shape, are not useful for generalization and are disfavored). Meta-learning allows a shallow CNN to learn abstract behavior that would otherwise require many more parameters and training.

The shallow CNNs and same-different relationship are a model for higher cognitive functions. Meta-learning and different forms of training could be useful to improve the reasoning capabilities of the models.

Another thing!

You can look for my other articles on Medium, and you can also connect or reach me on LinkedIn or in Bluesky. Check this repository, which contains weekly updated ML & AI news, or here for other tutorials and here for AI reviews. I am open to collaborations and projects, and you can reach me on LinkedIn.

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Lindsay, 2020, Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future, link
Li, 2020, A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects, link
He, 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, link
Ollikka, 2024, A comparison between humans and AI at recognizing objects in unusual poses, link
Premark, 1981, The codes of man and beasts, link
Blote, 1999, Young children’s organizational strategies on a same–different task: A microgenetic study and a training study, link
Lupker, 2015, Is there phonologically based priming in the same-different task? Evidence from Japanese-English bilinguals, link
Gentner, 2021, Learning same and different relations: cross-species comparisons, link
Kim, 2018, Not-so-clevr: learning same–different relations strains feedforward neural networks, link
Puebla, 2021, Can deep convolutional neural networks support relational reasoning in the same-different task? link
Gupta, 2025, Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation, link
Tartaglini, 2023, Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations, link
Finn, 2017, Model-agnostic meta-learning for fast adaptation of deep networks, link

The post The Basis of Cognitive Complexity: Teaching CNNs to See Connections appeared first on Towards Data Science.

The Invisible Revolution: How Vectors Are (Re)defining Business Success

Felix Schmidt — Thu, 10 Apr 2025 20:52:15 +0000

In a world that focuses more on data, business leaders must understand vector thinking. At first, vectors may appear as complicated as algebra was in school, but they serve as a fundamental building block. Vectors are as essential as algebra for tasks like sharing a bill or computing interest. They underpin our digital systems for decision making, customer engagement, and data protection.

They represent a radically different concept of relationships and patterns. They do not simply divide data into rigid categories. Instead, they offer a dynamic, multidimensional view of the underlying connections. Like “Similar” for two customers may mean more than demographics or purchase histories. It’s their behaviors, preferences, and habits that distinctly align. Such associations can be defined and measured accurately in a vector space. But for many modern businesses, the logic is too complex. So leaders tend to fall back on old, learned, rule-based patterns instead. And back then, fraud detection, for example, still used simple rules on transaction limits. We’ve evolved to recognize patterns and anomalies.

While it might have been common to block transactions that allocate 50% of your credit card limit at once just a few years ago, we are now able to analyze your retailer-specific spend history, look at average baskets of other customers at the very same retailers, and do some slight logic checks such as the physical location of your previous spends.

So a $7,000 transaction for McDonald’s in Dubai might just not happen if you just spent $3 on a bike rental in Amsterdam. Even $20 wouldn’t work since logical vector patterns can rule out the physical distance to be valid. Instead, the $7,000 transaction for your new E-Bike at a retailer near Amsterdam’s city center may just work flawlessly. Welcome to the insight of living in a world managed by vectors.

The danger of ignoring the paradigm of vectors is huge. Not mastering algebra can lead to bad financial decisions. Similarly, not knowing vectors can leave you vulnerable as a business leader. While the average customer may stay unaware of vectors as much as an average passenger in a plane is of aerodynamics, a business leader should be at least aware of what kerosene is and how many seats are to be occupied to break even for a specific flight. You may not need to fully understand the systems you rely on. A basic understanding helps to know when to reach out to the experts. And this is exactly my aim in this little journey into the world of vectors: become aware of the basic principles and know when to ask for more to better steer and manage your business.

In the hushed hallways of research labs and tech companies, a revolution was brewing. It would change how computers understood the world. This revolution has nothing to do with processing power or storage capacity. It was all about teaching machines to understand context, meaning, and nuance in words. This uses mathematical representations called vectors. Before we can appreciate the magnitude of this shift, we first need to understand what it differs from.

Think about the way humans take in information. When we look at a cat, we don’t just process a checklist of components: whiskers, fur, four legs. Instead, our brains work through a network of relationships, contexts, and associations. We know a cat is more like a lion than a bicycle. It’s not from memorizing this fact. Our brains have naturally learned these relationships. It boils down to target_transform_sequence or equivalent. Vector representations let computers consume content in a human-like way. And we ought to understand how and why this is true. It’s as fundamental as knowing algebra in the time of an impending AI revolution.

In this brief jaunt in the vector realm, I will explain how vector-based computing works and why it’s so transformative. The code examples are only examples, so they are just for illustration and have no stand-alone functionality. You don’t have to be an engineer to understand those concepts. All you have to do is follow along, as I walk you through examples with plain language commentary explaining each one step by step, one step at a time. I don’t aim to be a world-class mathematician. I want to make vectors understandable to everyone: business leaders, managers, engineers, musicians, and others.

What are vectors, anyway?

Photo by Pete F on Unsplash

It is not that the vector-based computing journey started recently. Its roots go back to the 1950s with the development of distributed representations in cognitive science. James McClelland and David Rumelhart, among other researchers, theorized that the brain holds concepts not as individual entities. Instead, it holds them as the compiled activity patterns of neural networks. This discovery dominated the path for contemporary vector representations.

The real breakthrough was three things coming together:
The exponential growth in computational power, the development of sophisticated neural network architectures, and the availability of massive datasets for training.

It is the combination of these elements that makes vector-based systems theoretically possible and practically implementable at scale. AI as the mainstream as people got to know it (with the likes of ChatGPT e.a.) is the direct consequence of this.

To better understand, let me put this in context: Conventional computing systems work on symbols —discrete, human-readable symbols and rules. A traditional system, for instance, might represent a customer as a record:

customer = {
    'id': '12345',
    'age': 34,
    'purchase_history': ['electronics', 'books'],
    'risk_level': 'low'
}

This representation may be readable or logical, but it misses subtle patterns and relationships. In contrast, vector representations encode information within high-dimensional space where relationships arise naturally through geometric proximity. That same customer might be represented as a 384-dimensional vector where each one of these dimensions contributes to a rich, nuanced profile. Simple code allows for 2-Dimensional customer data to be transformed into vectors. Let’s take a look at how simple this just is:

from sentence_transformers import SentenceTransformer
import numpy as np

class CustomerVectorization:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        
    def create_customer_vector(self, customer_data):
        """
        Transform customer data into a rich vector representation
        that captures subtle patterns and relationships
        """
        # Combine various customer attributes into a meaningful text representation
        customer_text = f"""
        Customer profile: {customer_data['age']} year old,
        interested in {', '.join(customer_data['purchase_history'])},
        risk level: {customer_data['risk_level']}
        """
        
        # Generate base vector from text description
        base_vector = self.model.encode(customer_text)
        
        # Enrich vector with numerical features
        numerical_features = np.array([
            customer_data['age'] / 100,  # Normalized age
            len(customer_data['purchase_history']) / 10,  # Purchase history length
            self._risk_level_to_numeric(customer_data['risk_level'])
        ])
        
        # Combine text-based and numerical features
        combined_vector = np.concatenate([
            base_vector,
            numerical_features
        ])
        
        return combined_vector
    
    def _risk_level_to_numeric(self, risk_level):
        """Convert categorical risk level to normalized numeric value"""
        risk_mapping = {'low': 0.1, 'medium': 0.5, 'high': 0.9}
        return risk_mapping.get(risk_level.lower(), 0.5)

I trust that this code example has helped demonstrate how easily complex customer data can be encoded into meaningful vectors. The method seems complex at first. But, it is simple. We merge text and numerical data on customers. This gives us rich, info-dense vectors that capture each customer’s essence. What I love most about this technique is its simplicity and flexibility. Similarly to how we encoded age, purchase history, and risk levels here, you could replicate this pattern to capture any other customer attributes that boil down to the relevant base case for your use case. Just recall the credit card spending patterns we described earlier. It’s similar data being turned into vectors to have a meaning far greater than it could ever have it stayed 2-dimensional and would be used for traditional rule-based logics.

What our little code example allowed us to do is having two very suggestive representations in one semantically rich space and one in normalized value space, mapping every record to a line in a graph that has direct comparison properties.

This allows the systems to identify complex patterns and relations that traditional data structures won’t be able to reflect adequately. With the geometric nature of vector spaces, the shape of these structures tells the stories of similarities, differences, and relationships, allowing for an inherently standardized yet flexible representation of complex data.

But going from here, you will see this structure copied across other applications of vector-based customer analysis: use relevant data, aggregate it in a format we can work with, and meta representation combines heterogeneous data into a common understanding of vectors. Whether it’s recommendation systems, customer segmentation models, or predictive analytics tools, this fundamental approach to thoughtful vectorization will underpin all of it. Thus, this fundamental approach is significant to know and understand even if you consider yourself non-tech and more into the business side.

Just keep in mind — the key is considering what part of your data has meaningful signals and how to encode them in a way that preserves their relationships. It is nothing but following your business logic in another way of thinking other than algebra. A more modern, multi-dimensional way.

The Mathematics of Meaning (Kings and Queens)

Photo by Debbie Fan on Unsplash

All human communication delivers rich networks of meaning that our brains wire to make sense of automatically. These are meanings that we can capture mathematically, using vector-based computing; we can represent words in space so that they are points in a multi-dimensional word space. This geometrical treatment allows us to think in spatial terms about the abstract semantic relations we are interested in, as distances and directions.

For instance, the relationship “King is to Queen as Man is to Woman” is encoded in a vector space in such a way that the direction and distance between the words “King” and “Queen” are similar to those between the words “Man” and “Woman.”

Let’s take a step back to understand why this might be: the key component that makes this system work is word embeddings — numerical representations that encode words as vectors in a dense vector space. These embeddings are derived from examining co-occurrences of words across large snippets of text. Just as we learn that “dog” and “puppy” are related concepts by observing that they occur in similar contexts, embedding algorithms learn to embed these words close to each other in a vector space.

Word embeddings reveal their real power when we look at how they encode analogical relationships. Think about what we know about the relationship between “king” and “queen.” We can tell through intuition that these words are different in gender but share associations related to the palace, authority, and leadership. Through a wonderful property of vector space systems — vector arithmetic — this relationship can be captured mathematically.

One does this beautifully in the classic example:

vector('king') - vector('man') + vector('woman') ≈ vector('queen')

This equation tells us that if we have the vector for “king,” and we subtract out the “man” vector (we remove the concept of “male”), and then we add the “woman” vector (we add the concept of “female”), we get a new point in space very close to that of “queen.” That’s not some mathematical coincidence — it’s based on how the embedding space has arranged the meaning in a sort of structured way.

We can apply this idea of context in Python with pre-trained word embeddings:

import gensim.downloader as api

# Load a pre-trained model that contains word vectors learned from Google News
model = api.load('word2vec-google-news-300')

# Define our analogy words
source_pair = ('king', 'man')
target_word = 'woman'

# Find which word completes the analogy using vector arithmetic
result = model.most_similar(
    positive=[target_word, source_pair[0]], 
    negative=[source_pair[1]], 
    topn=1
)

# Display the result
print(f"{source_pair[0]} is to {source_pair[1]} as {target_word} is to {result[0][0]}")

The structure of this vector space exposes many basic principles:

Semantic similarity is present as spatial proximity. Related words congregate: the neighborhoods of ideas. “Dog,” “puppy,” and “canine” would be one such cluster; meanwhile, “cat,” “kitten,” and “feline” would create another cluster nearby.
Relationships between words become directions in the space. The vector from “man” to “woman” encodes a gender relationship, and other such relationships (for example, “king” to “queen” or “actor” to “actress”) typically point in the same direction.
The magnitude of vectors can carry meaning about word importance or specificity. Common words often have shorter vectors than specialized terms, reflecting their broader, less specific meanings.

Working with relationships between words in this way gave us a geometric encoding of meaning and the mathematical precision needed to reflect the nuances of natural language processing to machines. Instead of treating words as separate symbols, vector-like systems can recognize patterns, make analogies, and even uncover relationships that were never programmed.

To better grasp what was just discussed I took the liberty to have the words we mentioned before (“King, Man, Women”; “Dog, Puppy, Canine”; “Cat, Kitten, Feline”) mapped to a corresponding 2D vector. These vectors numerically represent semantic meaning.

Visualization of the before-mentioned example terms as 2D word embeddings. Showing grouped categories for explanatory purposes. Data is fabricated and axes are simplified for educational purposes.

Human-related words have high positive values on both dimensions.
Dog-related words have negative x-values and positive y-values.
Cat-related words have positive x-values and negative y-values.

Be aware, those values are fabricated by me to illustrate better. As shown in the 2D Space where the vectors are plotted, you can observe groups based on the positions of the dots representing the vectors. The three dog-related words e.g. can be clustered as the “Dog” category etc. etc.

Grasping these basic principles gives us insight into both the capabilities and limitations of modern language AI, such as large language models (LLMs). Though these systems can do amazing analogical and relational gymnastics, they are ultimately cycles of geometric patterns based on the ways that words appear in proximity to one another in a body of text. An elaborate but, by definition, partial reflection of human linguistic comprehension. As such an Llm, since based on vectors, can only generate as output what it has received as input. Although that doesn’t mean it generates only what it has been trained 1:1, we all know about the fantastic hallucination capabilities of LLMs; it means that LLMs, unless specifically instructed, wouldn’t come up with neologisms or new language to describe things. This basic understanding is still lacking for a lot of business leaders that expect LLMs to be miracle machines unknowledgeable about the underlying principles of vectors.

A Tale of Distances, Angles, and Dinner Parties

Photo by OurWhisky Foundation on Unsplash

Now, let’s assume you’re throwing a dinner party and it’s all about Hollywood and the big movies, and you want to seat people based on what they like. You could just calculate “distance” between their preferences (genres, perhaps even hobbies?) and find out who should sit together. But deciding how you measure that distance can be the difference between compelling conversations and annoyed participants. Or awkward silences. And yes, that company party flashback is repeating itself. Sorry for that!

The same is true in the world of vectors. The distance metric defines how “similar” two vectors look, and therefore, ultimately, how well your system performs to predict an outcome.

Euclidean Distance: Straightforward, but Limited

Euclidean distance measures the straight-line distance between two points in space, making it easy to understand:

Euclidean distance is fine as long as vectors are physical locations.
However, in high-dimensional spaces (like vectors representing user behavior or preferences), this metric often falls short. Differences in scale or magnitude can skew results, focusing on scale over actual similarity.

Example: Two vectors might represent your dinner guests’ preferences for how much streaming services are used:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

While their preferences align, Euclidean distance would make them seem vastly different because of the disparity in overall activity.

But in higher-dimensional spaces, such as user behavior or textual meaning, Euclidean distance becomes increasingly less informative. It overweights magnitude, which can obscure comparisons. Consider two moviegoers: one has seen 200 action movies, the other has seen 10, but they both like the same genres. Because of their sheer activity level, the second viewer would appear much less similar to the first when using Euclidean distance though all they ever watched is Bruce Willis movies.

Cosine Similarity: Focused on Direction

The cosine similarity method takes a different approach. It focuses on the angle between vectors, not their magnitudes. It’s like comparing the path of two arrows. If they point the same way, they are aligned, no matter their lengths. This shows that it’s perfect for high-dimensional data, where we care about relationships, not scale.

If two vectors point in the same direction, they’re considered similar (cosine similarity approx of 1).
When opposing (so pointing in opposite directions), they differ (cosine similarity ≈ -1).
If they’re perpendicular (at a right angle of 90° to one another), they are unrelated (cosine similarity close to 0).

This normalizing property ensures that the similarity score correctly measures alignment, regardless of how one vector is scaled in comparison to another.

Example: Returning to our streaming preferences, let’s take a look at how our dinner guest’s preferences would look like as vectors:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

Let us discuss why cosine similarity is really effective in this case. So, when we compute cosine similarity for vec1 [5, 10, 5] and vec2 [1, 2, 1], we’re essentially trying to see the angle between these vectors.

The dot product normalizes the vectors first, dividing each component by the length of the vector. This operation “cancels” the differences in magnitude:

So for vec1: Normalization gives us [0.41, 0.82, 0.41] or so.
For vec2: Which resolves to [0.41, 0.82, 0.41] after normalization we will also have it.

And now we also understand why these vectors would be considered identical with regard to cosine similarity because their normalized versions are identical!

This tells us that even though dinner guest A views more total content, the proportion they allocate to any given genre perfectly mirrors dinner guest B’s preferences. It’s like saying both your guests dedicate 20% of their time to action, 60% to drama, and 20% to comedy, no matter the total hours viewed.

It’s this normalization that makes cosine similarity particularly effective for high-dimensional data such as text embeddings or user preferences.

When dealing with data of many dimensions (think hundreds or thousands of components of a vector for various features of a movie), it is often the relative significance of each dimension corresponding to the complete profile rather than the absolute values that matter most. Cosine similarity identifies precisely this arrangement of relative importance and is a powerful tool to identify meaningful relationships in complex data.

Hiking up the Euclidian Mountain Trail

Photo by Christian Mikhael on Unsplash

In this part, we will see how different approaches to measuring similarity behave in practice, with a concrete example from the real world and some little code example. Even if you are a non-techie, the code will be easy to understand for you as well. It’s to illustrate the simplicity of it all. No fear!

How about we quickly discuss a 10-mile-long hiking trail? Two friends, Alex and Blake, write trail reviews of the same hike, but each ascribes it a different character:

The trail gained 2,000 feet in elevation over just 2 miles! Easily doable with some high spikes in between!
Alex

and

Beware, we hiked 100 straight feet up in the forest terrain at the spike! Overall, 10 beautiful miles of forest!
Blake

These descriptions can be represented as vectors:

alex_description = [2000, 2]  # [elevation_gain, trail_distance]
blake_description = [100, 10]  # [elevation_gain, trail_distance]

Let’s combine both similarity measures and see what it tells us:

import numpy as np

def cosine_similarity(vec1, vec2):
    """
    Measures how similar the pattern or shape of two descriptions is,
    ignoring differences in scale. Returns 1.0 for perfectly aligned patterns.
    """
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

def euclidean_distance(vec1, vec2):
    """
    Measures the direct 'as-the-crow-flies' difference between descriptions.
    Smaller numbers mean descriptions are more similar.
    """
    return np.linalg.norm(np.array(vec1) - np.array(vec2))

# Alex focuses on the steep part: 2000ft elevation over 2 miles
alex_description = [2000, 2]  # [elevation_gain, trail_distance]

# Blake describes the whole trail: 100ft average elevation per mile over 10 miles
blake_description = [100, 10]  # [elevation_gain, trail_distance]

# Let's see how different these descriptions appear using each measure
print("Comparing how Alex and Blake described the same trail:")
print("\nEuclidean distance:", euclidean_distance(alex_description, blake_description))
print("(A larger number here suggests very different descriptions)")

print("\nCosine similarity:", cosine_similarity(alex_description, blake_description))
print("(A number close to 1.0 suggests similar patterns)")

# Let's also normalize the vectors to see what cosine similarity is looking at
alex_normalized = alex_description / np.linalg.norm(alex_description)
blake_normalized = blake_description / np.linalg.norm(blake_description)

print("\nAlex's normalized description:", alex_normalized)
print("Blake's normalized description:", blake_normalized)

So now, running this code, something magical happens:

Comparing how Alex and Blake described the same trail:

Euclidean distance: 8.124038404635959
(A larger number here suggests very different descriptions)

Cosine similarity: 0.9486832980505138
(A number close to 1.0 suggests similar patterns)

Alex's normalized description: [0.99975 0.02236]
Blake's normalized description: [0.99503 0.09950]

This output shows why, depending on what you are measuring, the same trail may appear different or similar.

The large Euclidean distance (8.12) suggests these are very different descriptions. It’s understandable that 2000 is a lot different from 100, and 2 is a lot different from 10. It’s like taking the raw difference between these numbers without understanding their meaning.

But the high Cosine similarity (0.95) tells us something more interesting: both descriptions capture a similar pattern.

If we look at the normalized vectors, we can see it, too; both Alex and Blake are describing a trail in which elevation gain is the prominent feature. The first number in each normalized vector (elevation gain) is much larger relative to the second (trail distance). Either that or elevating them both and normalizing based on proportion — not volume — since they both share the same trait defining the trail.

Perfectly true to life: Alex and Blake hiked the same trail but focused on different parts of it when writing their review. Alex focused on the steeper section and described a 100-foot climb, and Blake described the profile of the entire trail, averaged to 200 feet per mile over 10 miles. Cosine similarity identifies these descriptions as variations of the same basic trail pattern, whereas Euclidean distance regards them as completely different trails.

This example highlights the need to select the appropriate similarity measure. Normalizing and taking cosine similarity gives many meaningful correlations that are missed by just taking distances like Euclidean in real use cases.

Real-World Impacts of Metric Choices

Photo by fabio on Unsplash

The metric you pick doesn’t merely change the numbers; it influences the results of complex systems. Here’s how it breaks down in various domains:

In Recommendation Engines: When it comes to cosine similarity, we can group users who have the same tastes, even if they are doing different amounts of overall activity. A streaming service could use this to recommend movies that align with a user’s genre preferences, regardless of what is popular among a small subset of very active viewers.
In Document Retrieval: When querying a database of documents or research papers, cosine similarity ranks documents according to whether their content is similar in meaning to the user’s query, rather than their text length. This enables systems to retrieve results that are contextually relevant to the query, even though the documents are of a wide range of sizes.
In Fraud Detection: Patterns of behavior are often more important than pure numbers. Cosine similarity can be used to detect anomalies in spending habits, as it compares the direction of the transaction vectors — type of merchant, time of day, transaction amount, etc. — rather than the absolute magnitude.

And these differences matter because they give a sense of how systems “think”. Let’s get back to that credit card example one more time: It might, for example, identify a high-value $7,000 transaction for your new E-Bike as suspicious using Euclidean distance — even if that transaction is normal for you given you have an average spent of $20,000 a mont.

A cosine-based system, on the other hand, understands that the transaction is consistent with what the user typically spends their money on, thus avoiding unnecessary false notifications.

But measures like Euclidean distance and cosine similarity are not merely theoretical. They’re the blueprints on which real-world systems stand. Whether it’s recommendation engines or fraud detection, the metrics we choose will directly impact how systems make sense of relationships in data.

Vector Representations in Practice: Industry Transformations

Photo by Louis Reed on Unsplash

This ability for abstraction is what makes vector representations so powerful — they transform complex and abstract field data into concepts that can be scored and actioned. These insights are catalyzing fundamental transformations in business processes, decision-making, and customer value delivery across sectors.

Next, we will explore the solution use cases we are highlighting as concrete examples to see how vectors are freeing up time to solve big problems and creating new opportunities that have a big impact. I picked an industry to show what vector-based approaches to a challenge can achieve, so here is a healthcare example from a clinical setting. Why? Because it matters to us all and is rather easy to relate to than digging into the depths of the finance system, insurance, renewable energy, or chemistry.

Healthcare Spotlight: Pattern Recognition in Complex Medical Data

The healthcare industry poses a perfect storm of challenges that vector representations can uniquely solve. Think of the complexities of patient data: medical histories, genetic information, lifestyle factors, and treatment outcomes all interact in nuanced ways that traditional rule-based systems are incapable of capturing.

At Massachusetts General Hospital, researchers implemented a vector-based early detection system for sepsis, a condition in which every hour of early detection increases the chances of survival by 7.6% (see the full study at pmc.ncbi.nlm.nih.gov/articles/PMC6166236/).

In this new methodology, spontaneous neutrophil velocity profiles (SVP) are used to describe the movement patterns of neutrophils from a drop of blood. We won’t get too medically detailed here, because we’re vector-focused today, but a neutrophil is an immune cell that is kind of a first responder in what the body uses to fight off infections.

The system then encodes each neutrophil’s motion as a vector that captures not just its magnitude (i.e., speed), but also its direction. So they converted biological patterns to high-dimensional vector spaces; thus, they got subtle differences and showed that healthy individuals and sepsis patients exhibited statistically significant differences in movement. Then, these numeric vectors were processed with the help of a Machine Learning model that was trained to detect early signs of sepsis. The result was a diagnostic tool that reached impressive sensitivity (97%) and specificity (98%) to achieve a rapid and accurate identification of this fatal condition — probably with the cosine similarity (the paper doesn’t go into much detail, so this is pure speculation, but it would be the most suitable) that we just learned about a moment ago.

This is just one example of how medical data can be encoded into its vector representations and turned into malleable, actionable insights. This approach made it possible to re-contextualize complex relationships and, along with tread-based machine learning, worked around the limitations of previous diagnostic modalities and proved to be a potent tool for clinicians to save lives. It’s a powerful reminder that Vectors aren’t merely theoretical constructs — they’re practical, life-saving solutions that are powering the future of healthcare as much as your credit card risk detection software and hopefully also your business.

Lead and understand, or face disruption. The naked truth.

Photo by Hunters Race on Unsplash

With all you have read about by now: Think of a decision as small as the decision about the metrics under which data relationships are evaluated. Leaders risk making assumptions that are subtle yet disastrous. You are basically using algebra as a tool, and while getting some result, you cannot know if it is right or not: making leadership decisions without understanding the fundamentals of vectors is like calculating using a calculator but not knowing what formulas you are using.

The good news is this doesn’t mean that business leaders have to become data scientists. Vectors are delightful because, once the core ideas have been grasped, they become very easy to work with. An understanding of a handful of concepts (for example, how vectors encode relationships, why distance metrics are important, and how embedding models function) can fundamentally change how you make high-level decisions. These tools will help you ask better questions, work with technical teams more effectively, and make sound decisions about the systems that will govern your business.

The returns on this small investment in comprehension are huge. There is much talk about personalization. Yet, few organizations use vector-based thinking in their business strategies. It could help them leverage personalization to its full potential. Such an approach would delight customers with tailored experiences and build loyalty. You could innovate in areas like fraud detection and operational efficiency, leveraging subtle patterns in data that traditional ones miss — or perhaps even save lives, as described above. Equally important, you can avoid expensive missteps that happen when leaders defer to others for key decisions without understanding what they mean.

The truth is, vectors are here now, driving a vast majority of all the hyped AI technology behind the scenes to help create the world we navigate in today and tomorrow. Companies that do not adapt their leadership to think in vectors risk falling behind a competitive landscape that becomes ever more data-driven. One who adopts this new paradigm will not just survive but will prosper in an age of never-ending AI innovation.

Now is the moment to act. Start to view the world through vectors. Study their tongue, examine their doctrine, and ask how the new could change your tactics and your lodestars. Much in the way that algebra became an essential tool for writing one’s way through practical life challenges, vectors will soon serve as the literacy of the data age. Actually they do already. It is the future of which the powerful know how to take control. The question is not if vectors will define the next era of businesses; it is whether you are prepared to lead it.

The post The Invisible Revolution: How Vectors Are (Re)defining Business Success appeared first on Towards Data Science.

How to Measure Real Model Accuracy When Labels Are Noisy

Krishna Rao — Thu, 10 Apr 2025 19:22:26 +0000

Ground truth is never perfect. From scientific measurements to human annotations used to train deep learning models, ground truth always has some amount of errors. ImageNet, arguably the most well-curated image dataset has 0.3% errors in human annotations. Then, how can we evaluate predictive models using such erroneous labels?

In this article, we explore how to account for errors in test data labels and estimate a model’s “true” accuracy.

Example: image classification

Let’s say there are 100 images, each containing either a cat or a dog. The images are labeled by human annotators who are known to have 96% accuracy (Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ). If we train an image classifier on some of this data and find that it has 90% accuracy on a hold-out set (Aᵐᵒᵈᵉˡ), what is the “true” accuracy of the model (Aᵗʳᵘᵉ)? A couple of observations first:

Within the 90% of predictions that the model got “right,” some examples may have been incorrectly labeled, meaning both the model and the ground truth are wrong. This artificially inflates the measured accuracy.
Conversely, within the 10% of “incorrect” predictions, some may actually be cases where the model is right and the ground truth label is wrong. This artificially deflates the measured accuracy.

Given these complications, how much can the true accuracy vary?

Range of true accuracy

True accuracy of model for perfectly correlated and perfectly uncorrelated errors of model and label. Figure by author.

The true accuracy of our model depends on how its errors correlate with the errors in the ground truth labels. If our model’s errors perfectly overlap with the ground truth errors (i.e., the model is wrong in exactly the same way as human labelers), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 — (1–0.96) = 86%

Alternatively, if our model is wrong in exactly the opposite way as human labelers (perfect negative correlation), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 + (1–0.96) = 94%

Or more generally:

Aᵗʳᵘᵉ = Aᵐᵒᵈᵉˡ ± (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

It’s important to note that the model’s true accuracy can be both lower and higher than its reported accuracy, depending on the correlation between model errors and ground truth errors.

Probabilistic estimate of true accuracy

In some cases, inaccuracies among labels are randomly spread among the examples and not systematically biased toward certain labels or regions of the feature space. If the model’s inaccuracies are independent of the inaccuracies in the labels, we can derive a more precise estimate of its true accuracy.

When we measure Aᵐᵒᵈᵉˡ (90%), we’re counting cases where the model’s prediction matches the ground truth label. This can happen in two scenarios:

Both model and ground truth are correct. This happens with probability Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ.
Both model and ground truth are wrong (in the same way). This happens with probability (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ).

Under independence, we can express this as:

Aᵐᵒᵈᵉˡ = Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ + (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

Rearranging the terms, we get:

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ + Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1) / (2 × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1)

In our example, that equals (0.90 + 0.96–1) / (2 × 0.96–1) = 93.5%, which is within the range of 86% to 94% that we derived above.

The independence paradox

Plugging in Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ as 0.96 from our example, we get

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ — 0.04) / (0.92). Let’s plot this below.

True accuracy as a function of model’s reported accuracy when ground truth accuracy = 96%. Figure by author.

Strange, isn’t it? If we assume that model’s errors are uncorrelated with ground truth errors, then its true accuracy Aᵗʳᵘᵉ is always higher than the 1:1 line when the reported accuracy is > 0.5. This holds true even if we vary Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ:

Model’s “true” accuracy as a function of its reported accuracy and ground truth accuracy. Figure by author.

Error correlation: why models often struggle where humans do

The independence assumption is crucial but often doesn’t hold in practice. If some images of cats are very blurry, or some small dogs look like cats, then both the ground truth and model errors are likely to be correlated. This causes Aᵗʳᵘᵉ to be closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)) than the upper bound.

More generally, model errors tend to be correlated with ground truth errors when:

Both humans and models struggle with the same “difficult” examples (e.g., ambiguous images, edge cases)
The model has learned the same biases present in the human labeling process
Certain classes or examples are inherently ambiguous or challenging for any classifier, human or machine
The labels themselves are generated from another model
There are too many classes (and thus too many different ways of being wrong)

Best practices

The true accuracy of a model can differ significantly from its measured accuracy. Understanding this difference is crucial for proper model evaluation, especially in domains where obtaining perfect ground truth is impossible or prohibitively expensive.

When evaluating model performance with imperfect ground truth:

Conduct targeted error analysis: Examine examples where the model disagrees with ground truth to identify potential ground truth errors.
Consider the correlation between errors: If you suspect correlation between model and ground truth errors, the true accuracy is likely closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)).
Obtain multiple independent annotations: Having multiple annotators can help estimate ground truth accuracy more reliably.

Conclusion

In summary, we learned that:

The range of possible true accuracy depends on the error rate in the ground truth
When errors are independent, the true accuracy is often higher than measured for models better than random chance
In real-world scenarios, errors are rarely independent, and the true accuracy is likely closer to the lower bound

The post How to Measure Real Model Accuracy When Labels Are Noisy appeared first on Towards Data Science.

Ivory Tower Notes: The Problem

Marina Tosic — Thu, 10 Apr 2025 18:48:08 +0000

Did you ever spend months on a Machine Learning project, only to discover you never defined the “correct” problem at the start? If so, or even if not, and you are only starting with the data science or AI field, welcome to my first Ivory Tower Note, where I will address this topic.

The term “Ivory Tower” is a metaphor for a situation in which someone is isolated from the practical realities of everyday life. In academia, the term often refers to researchers who engage deeply in theoretical pursuits and remain distant from the realities that practitioners face outside academia.

As a former researcher, I wrote a short series of posts from my old Ivory Tower notes — the notes before the LLM era.

Scary, I know. I am writing this to manage expectations and the question, “Why ever did you do things this way?” — “Because no LLM told me how to do otherwise 10+ years ago.”

That’s why my notes contain “legacy” topics such as data mining, machine learning, multi-criteria decision-making, and (sometimes) human interactions, airplanes and art.

Nonetheless, whenever there is an opportunity, I will map my “old” knowledge to generative AI advances and explain how I applied it to datasets beyond the Ivory Tower.

Welcome to post #1…

How every Machine Learning and AI journey starts

— It starts with a problem.

For you, this is usually “the” problem because you need to live with it for months or, in the case of research, years.

With “the” problem, I am addressing the business problem you don’t fully understand or know how to solve at first.

An even worse scenario is when you think you fully understand and know how to solve it quickly. This then creates only more problems that are again only yours to solve. But more about this in the upcoming sections.

So, what’s “the” problem about?

Causa: It’s mostly about not managing or leveraging resources properly — workforce, equipment, money, or time.

Ratio: It’s usually about generating business value, which can span from improved accuracy, increased productivity, cost savings, revenue gains, faster reaction, decision, planning, delivery or turnaround times.

Veritas: It’s always about finding a solution that relies and is hidden somewhere in the existing dataset.

Or, more than one dataset that someone labelled as “the one”, and that’s waiting for you to solve the problem. Because datasets follow and are created from technical or business process logs, “there has to be a solution lying somewhere within them.”

Ah, if only it were so easy.

Avoiding a different chain of thought again, the point is you will need to:

1 — Understand the problem fully,
2 — If not given, find the dataset “behind” it, and
3 — Create a methodology to get to the solution that will generate business value from it.

On this path, you will be tracked and measured, and time will not be on your side to deliver the solution that will solve “the universe equation.”

That’s why you will need to approach the problem methodologically, drill down to smaller problems first, and focus entirely on them because they are the root cause of the overall problem.

That’s why it’s good to learn how to…

Think like a Data Scientist.

Returning to the problem itself, let’s imagine that you are a tourist lost somewhere in the big museum, and you want to figure out where you are. What you do next is walk to the closest info map on the floor, which will show your current location.

At this moment, in front of you, you see something like this:

Data Science Process. Image by Author, inspired by Microsoft Learn

The next thing you might tell yourself is, “I want to get to Frida Kahlo’s painting.” (Note: These are the insights you want to get.)

Because your goal is to see this one painting that brought you miles away from your home and now sits two floors below, you head straight to the second floor. Beforehand, you memorized the shortest path to reach your goal. (Note: This is the initial data collection and discovery phase.)

However, along the way, you stumble upon some obstacles — the elevator is shut down for renovation, so you have to use the stairs. The museum paintings were reordered just two days ago, and the info plans didn’t reflect the changes, so the path you had in mind to get to the painting is not accurate.

Then you find yourself wandering around the third floor already, asking quietly again, “How do I get out of this labyrinth and get to my painting faster?”

While you don’t know the answer, you ask the museum staff on the third floor to help you out, and you start collecting the new data to get the correct route to your painting. (Note: This is a new data collection and discovery phase.)

Nonetheless, once you get to the second floor, you get lost again, but what you do next is start noticing a pattern in how the paintings have been ordered chronologically and thematically to group the artists whose styles overlap, thus giving you an indication of where to go to find your painting. (Note: This is a modelling phase overlapped with the enrichment phase from the dataset you collected during school days — your art knowledge.)

Finally, after adapting the pattern analysis and recalling the collected inputs on the museum route, you arrive in front of the painting you had been planning to see since booking your flight a few months ago.

What I described now is how you approach data science and, nowadays, generative AI problems. You always start with the end goal in mind and ask yourself:

“What is the expected outcome I want or need to get from this?”

Then you start planning from this question backwards. The example above started with requesting holidays, booking flights, arranging accommodation, traveling to a destination, buying museum tickets, wandering around in a museum, and then seeing the painting you’ve been reading about for ages.

Of course, there is more to it, and this process should be approached differently if you need to solve someone else’s problem, which is a bit more complex than locating the painting in the museum.

In this case, you have to…

Ask the “good” questions.

To do this, let’s define what a good question means [1]:

A good data science question must be concrete, tractable, and answerable. Your question works well if it naturally points to a feasible approach for your project. If your question is too vague to suggest what data you need, it won’t effectively guide your work.

Formulating good questions keeps you on track so you don’t get lost in the data that should be used to get to the specific problem solution, or you don’t end up solving the wrong problem.

Going into more detail, good questions will help identify gaps in reasoning, avoid faulty premises, and create alternative scenarios in case things do go south (which almost always happens).

Image created by Author after analyzing “Chapter 2. Setting goals by asking good questions” from “Think Like a Data Scientist” book [2]

From the above-presented diagram, you understand how good questions, first and foremost, need to support concrete assumptions. This means they need to be formulated in a way that your premises are clear and ensure they can be tested without mixing up facts with opinions.

Good questions produce answers that move you closer to your goal, whether through confirming hypotheses, providing new insights, or eliminating wrong paths. They are measurable, and with this, they connect to project goals because they are formulated with consideration of what’s possible, valuable, and efficient [2].

Good questions are answerable with available data, considering current data relevance and limitations.

Last but not least, good questions anticipate obstacles. If something is certain in data science, this is the uncertainty, so having backup plans when things don’t work as expected is important to produce results for your project.

Let’s exemplify this with one use case of an airline company that has a challenge with increasing its fleet availability due to unplanned technical groundings (UTG).

These unexpected maintenance events disrupt flights and cost the company significant money. Because of this, executives decided to react to the problem and call in a data scientist (you) to help them improve aircraft availability.

Now, if this would be the first data science task you ever got, you would maybe start an investigation by asking:

“How can we eliminate all unplanned maintenance events?”

You understand how this question is an example of the wrong or “poor” one because:

It is not realistic: It includes every possible defect, both small and big, into one impossible goal of “zero operational interruptions”.
It doesn’t hold a measure of success: There’s no concrete metric to show progress, and if you’re not at zero, you’re at “failure.”
It is not data-driven: The question didn’t cover which data is recorded before delays occur, and how the aircraft unavailability is measured and reported from it.

So, instead of this vague question, you would probably ask a set of targeted questions:

Which aircraft (sub)system is most critical to flight disruptions?
(Concrete, specific, answerable) This question narrows down your scope, focusing on only one or two specific (sub) systems affecting most delays.
What constitutes “critical downtime” from an operational perspective?
(Valuable, ties to business goals) If the airline (or regulatory body) doesn’t define how many minutes of unscheduled downtime matter for schedule disruptions, you might waste effort solving less urgent issues.
Which data sources capture the root causes, and how can we fuse them?
(Manageable, narrows the scope of the project further) This clarifies which data sources one would need to find the problem solution.

With these sharper questions, you will drill down to the real problem:

Not all delays weigh the same in cost or impact. The “correct” data science problem is to predict critical subsystem failures that lead to operationally costly interruptions so maintenance crews can prioritize them.

That’s why…

Defining the problem determines every step after.

It’s the foundation upon which your data, modelling, and evaluation phases are built .

Image created by Author after analyzing and overlapping different images from “Chapter 2. Setting goals by asking good questions, Think Like a Data Scientist” book [2]

It means you are clarifying the project’s objectives, constraints, and scope; you need to articulate the ultimate goal first and, except for asking “What’s the expected outcome I want or need to get from this?”, ask as well:

What would success look like and how can we measure it?

From there, drill down to (possible) next-level questions that you (I) have learned from the Ivory Tower days:
— History questions: “Has anyone tried to solve this before? What happened? What is still missing?”
— Context questions: “Who is affected by this problem and how? How are they partially resolving it now? Which sources, methods, and tools are they using now, and can they still be reused in the new models?”
— Impact Questions: “What happens if we don’t solve this? What changes if we do? Is there a value we can create by default? How much will this approach cost?”
— Assumption Questions: “What are we taking for granted that might not be true (especially when it comes to data and stakeholders’ ideas)?”
— ….

Then, do this in the loop and always “ask, ask again, and don’t stop asking” questions so you can drill down and understand which data and analysis are needed and what the ground problem is.

This is the evergreen knowledge you can apply nowadays, too, when deciding if your problem is of a predictive or generative nature.

(More about this in some other note where I will explain how problematic it is trying to solve the problem with the models that have never seen — or have never been trained on — similar problems before.)

Now, going back to memory lane…

I want to add one important note: I have learned from late nights in the Ivory Tower that no amount of data or data science knowledge can save you if you’re solving the wrong problem and trying to get the solution (answer) from a question that was simply wrong and vague.

When you have a problem on hand, do not rush into assumptions or building the models without understanding what you need to do (Festina lente).

In addition, prepare yourself for unexpected situations and do a proper investigation with your stakeholders and domain experts because their patience will be limited, too.

With this, I want to say that the “real art” of being successful in data projects is knowing precisely what the problem is, figuring out if it can be solved in the first place, and then coming up with the “how” part.

You get there by learning to ask good questions.

To end this narrative, recall how Einstein famously said:

If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute solving it.

Thank you for reading, and stay tuned for the next Ivory Tower note.

If you found this post valuable, feel free to share it with your network.

Connect for more stories on Medium and LinkedIn .

References:

[1] DS4Humans, Backwards Design, accessed: April 5th 2025, https://ds4humans.com/40_in_practice/05_backwards_design.html#defining-a-good-question

[2] Godsey, B. (2017), Think Like a Data Scientist: Tackle the data science process step-by-step, Manning Publications.

The post Ivory Tower Notes: The Problem appeared first on Towards Data Science.

Why CatBoost Works So Well: The Engineering Behind the Magic

Shubham Gandhi — Thu, 10 Apr 2025 00:28:11 +0000

Gradient boosting is a cornerstone technique for modeling tabular data due to its speed and simplicity. It delivers great results without any fuss. When you look around you’ll see multiple options like LightGBM, XGBoost, etc. Catboost is one such variant. In this post, we will take a detailed look at this model, explore its inner workings, and understand what makes it a great choice for real-world tasks.

Target Statistic

Target Encoding Example: the average value of the target variable for a category is used to replace each category. Image by author

Target Encoding Example: the average value of the target variable for a category is used to replace each category

One of the important contributions of the CatBoost paper is a new method of calculating the Target Statistic. What is a Target Statistic? If you have worked with categorical variables before, you’d know that the most rudimentary way to deal with categorical variables is to use one-hot encoding. From experience, you’d also know that this introduces a can of problems like sparsity, curse of dimensionality, memory issues, etc. Especially for categorical variables with high cardinality.

Greedy Target Statistic

To avoid one-hot encoding, we calculate the Target Statistic instead for the categorical variables. This means we calculate the mean of the target variable at each unique value of the categorical variable. So if a categorical variable takes the values — A, B, C then we will calculate the average value of $\text{y}$ over all these values and replace these values with the average of $\text{y}$ at each unique value.

That sounds good, right? It does but this approach comes with its problems — namely Target Leakage. To understand this, let’s take an extreme example. Extreme examples are often the easiest way to eke out issues in the approach. Consider the below dataset:

Categorical Column	Target Column
A	0
B	1
C	0
D	1
E	0

Greedy Target Statistic: Compute the mean target value for each unique category

Now let’s write the equation for calculating the Target Statistic:
\[\hat{x}^i_k = \frac{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} \cdot y_j + a p
}{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} + a
}\]

Here $x^i_j$ is the value of the i-th categorical feature for the j-th sample. So for the k-th sample, we iterate over all samples of $x^i$, select the ones having the value $x^i_k$, and take the average value of $y$ over those samples. Instead of taking a direct average, we take a smoothened average which is what the $a$ and $p$ terms are for. The $a$ parameter is the smoothening parameter and $p$ is the global mean of $y$.

If we calculate the Target Statistic using the formula above, we get:

Categorical Column	Target Column	Target Statistic
A	0	$\frac{ap}{1+a}$
B	1	$\frac{1+ap}{1+a}$
C	0	$\frac{ap}{1+a}$
D	1	$\frac{1+ap}{1+a}$
E	0	$\frac{ap}{1+a}$

Calculation of Greedy Target Statistic with Smoothening

Now if I use this Target Statistic column as my training data, I will get a perfect split at $ threshold = \frac{0.5+ap}{1+a}$. Anything above this value will be classified as 1 and anything below will be classified as 0. I have a perfect classification at this point, so I get 100% accuracy on my training data.

Let’s take a look at the test data. Here, since we are assuming that the feature has all unique values, the Target Statistic becomes—
\[TS = \frac{0+ap}{0+a} = p\]
If $threshold$ is greater than $p$, all test data predictions will be $0$. Conversely, if $threshold$ is less than $p$, all test data predictions will be $1$ leading to poor performance on the test set.

Although we rarely see datasets where values of a categorical variable are all unique, we do see cases of high cardinality. This extreme example shows the pitfalls of using Greedy Target Statistic as an encoding approach.

Leave One Out Target Statistic

So the Greedy TS didn’t work out quite well for us. Let’s try another method— the Leave One Out Target Statistic method. At first glance, this looks promising. But, as it turns out, this too has its problems. Let’s see how with another extreme example. This time let’s assume that our categorical variable $x^i$ has only one unique value, i.e., all values are the same. Consider the below data:

Categorical Column	Target Column
A	0
A	1
A	0
A	1

Example data for an extreme case where a categorical feature has just one unique value

If calculate the leave one out target statistic, we get:

Categorical Column	Target Column	Target Statistic
A	0	$\frac{n^+ -y_k + ap}{n+a}$
A	1	$\frac{n^+ -y_k + ap}{n+a}$
A	0	$\frac{n^+ -y_k + ap}{n+a}$
A	1	$\frac{n^+ -y_k + ap}{n+a}$

Calculation of Leave One Out Target Statistic with Smoothening

Here:
$n$ is the total samples in the data (in our case this 4)
$n^+$ is the number of positive samples in the data (in our case this 2)
$y_k$ is the value of the target column in that row
Substituting the above, we get:

Categorical Column	Target Column	Target Statistic
A	0	$\frac{2 + ap}{4+a}$
A	1	$\frac{1 + ap}{4+a}$
A	0	$\frac{2 + ap}{4+a}$
A	1	$\frac{1 + ap}{4+a}$

Substituing values of n and n⁺

Now, if I use this Target Statistic column as my training data, I will get a perfect split at $ threshold = \frac{1.5+ap}{4+a}$. Anything above this value will be classified as 0 and anything below will be classified as 1. I have a perfect classification at this point, so I again get 100% accuracy on my training data.

You see the problem, right? My categorical variable which doesn’t have more than a unique value is producing different values for Target Statistic which will perform great on the training data but will fail miserably on the test data.

Ordered Target Statistic

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

CatBoost introduces a technique called Ordered Target Statistic to address the issues discussed above. This is the core principle of CatBoost’s handling of categorical variables.

This method, inspired by online learning, uses only past data to make predictions. CatBoost generates a random permutation (random ordering) of the training data($\sigma$). To compute the Target Statistic for a sample at row $k$, CatBoost uses samples from row $1$ to $k-1$. For the test data, it uses the entire train data to compute the statistic.

Additionally, CatBoost generates a new permutation for each tree, rather than reusing the same permutation each time. This reduces the variance that can arise in the early samples.

Ordered Boosting

This visualization shows how CatBoost computes residuals and updates the model: for sample xᵢ, the model predicts using only earlier data points. Source

Another important innovation introduced by the CatBoost paper is its use of Ordered Boosting. It builds on similar principles as ordered target statistics, where CatBoost randomly permutes the training data at the start of each tree and makes predictions sequentially.

In traditional boosting methods, when training tree $t$, the model uses predictions from the previous tree $t−1$ for all training samples, including the one it is currently predicting. This can lead to target leakage, as the model may indirectly use the label of the current sample during training.

To address this issue, CatBoost uses Ordered Boosting where, for a given sample, it only uses predictions from previous rows in the training data to calculate gradients and build trees. For each row $i$ in the permutation, CatBoost calculates the output value of a leaf using only the samples before $i$. The model uses this value to get the prediction for row $i$. Thus, the model predicts each row without looking at its label.

CatBoost trains each tree using a new random permutation to average the variance in early samples in one permutation.
Let’s say we have 5 data points: A, B, C, D, E. CatBoost creates a random permutation of these points. Suppose the permutation is: σ = [C, A, E, B, D]

Step	Data Used to Train	Data Point Being Predicted	Notes
1	—	C	No previous data → use prior
2	C	A	Model trained on C only
3	C, A	E	Model trained on C, A
4	C, A, E	B	Model trained on C, A, E
5	C, A, E, B	D	Model trained on C, A, E, B

Table highlighting how CatBoost uses random permutation to perform training

This avoids using the actual label of the current row to get the prediction thus preventing leakage.

Building a Tree

Each time CatBoost builds a tree, it creates a random permutation of the training data. It calculates the ordered target statistic for all the categorical variables with more than two unique values. For a binary categorical variable, it maps the values to zeros and ones.

CatBoost processes data as if the data is arriving sequentially. It begins with an initial prediction of zero for all instances, meaning the residuals are initially equivalent to the target values.

As training proceeds, CatBoost updates the leaf output for each sample using the residuals of the previous samples that fall into the same leaf. By not using the current sample’s label for prediction, CatBoost effectively prevents data leakage.

Split Candidates

CatBoost bins continuous features to reduce the search space for optimal splits. Each bin edge and split point represents a potential decision threshold. Image by author

At the core of a decision tree lies the task of selecting the optimal feature and threshold for splitting a node. This involves evaluating multiple feature-threshold combinations and selecting the one that gives the best reduction in loss. CatBoost does something similar. It discretizes the continuous variable into bins to simplify the search for the optimal combination. It evaluates each of these feature-bin combinations to determine the best split

CatBoost uses Oblivious Trees, a key difference compared to other trees, where it uses the same split across all nodes at the same depth.

Oblivious Trees

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

Unlike standard decision trees, where different nodes can split on different conditions (feature-threshold), Oblivious Trees split across the same conditions across all nodes at the same depth of a tree. At a given depth, all samples are evaluated at the same feature-threshold combination. This symmetry has several implications:

Speed and simplicity: since the same condition is applied across all nodes at the same depth, the trees produced are simpler and faster to train
Regularization: Since all trees are forced to apply the same condition across the tree at the same depth, there is a regularization effect on the predictions
Parallelization: the uniformity of the split condition, makes it easier to parallelize the tree creation and usage of GPU to accelerate training

Conclusion

CatBoost stands out by directly tackling a long-standing challenge: how to handle categorical variables effectively without causing target leakage. Through innovations like Ordered Target Statistics, Ordered Boosting, and the use of Oblivious Trees, it efficiently balances robustness and accuracy.

If you found this deep dive helpful, you might enjoy another deep dive on the differences between Stochastic Gradient Classifer and Logistic Regression

Mining Rules from Data

Mariya Mansurova — Wed, 09 Apr 2025 16:54:40 +0000

Working with products, we might face a need to introduce some “rules”. Let me explain what I mean by “rules” in practical examples:

Imagine that we’re seeing a massive wave of fraud in our product, and we want to restrict onboarding for a particular segment of customers to lower this risk. For example, we found out that the majority of fraudsters had specific user agents and IP addresses from certain countries.
Another option is to send coupons to customers to use in our online shop. However, we would like to treat only customers who are likely to churn since loyal users will return to the product anyway. We might figure out that the most feasible group is customers who joined less than a year ago and decreased their spending by 30%+ last month.
Transactional businesses often have a segment of customers where they are losing money. For example, a bank customer passed the verification and regularly reached out to customer support (so generated onboarding and servicing costs) while doing almost no transactions (so not generating any revenue). The bank might introduce a small monthly subscription fee for customers with less than 1000$ in their account since they are likely non-profitable.

Of course, in all these cases, we might have used a complex Machine Learning model that would take into account all the factors and predict the probability (either of a customer being a fraudster or churning). Still, under some circumstances, we might prefer just a set of static rules for the following reasons:

The speed and complexity of implementation. Deploying an ML model in production takes time and effort. If you are experiencing a fraud wave right now, it might be more feasible to go live with a set of static rules that can be implemented quickly and then work on a comprehensive solution.
Interpretability. ML models are black boxes. Even though we might be able to understand at a high level how they work and what features are the most important ones, it’s challenging to explain them to customers. In the example of subscription fees for non-profitable customers, it’s important to share a set of transparent rules with customers so that they can understand the pricing.
Compliance. Some industries, like finance or healthcare, might require auditable and rule-based decisions to meet compliance requirements.

In this article, I want to show you how we can solve business problems using such rules. We will take a practical example and go really deep into this topic:

we will discuss which models we can use to mine such rules from data,
we will build a Decision Tree Classifier from scratch to learn how it works,
we will fit the sklearn Decision Tree Classifier model to extract the rules from the data,
we will learn how to parse the Decision Tree structure to get the resulting segments,
finally, we will explore different options for category encoding, since the sklearn implementation doesn’t support categorical variables.

We have lots of topics to cover, so let’s jump into it.

Case

As usual, it’s easier to learn something with a practical example. So, let’s start by discussing the task we will be solving in this article.

We will work with the Bank Marketing dataset (CC BY 4.0 license). This dataset contains data about the direct marketing campaigns of a Portuguese banking institution. For each customer, we know a bunch of features and whether they subscribed to a term deposit (our target).

Our business goal is to maximise the number of conversions (subscriptions) with limited operational resources. So, we can’t call the whole user base, and we want to reach the best outcome with the resources we have.

The first step is to look at the data. So, let’s load the data set.

import pandas as pd
pd.set_option('display.max_colwidth', 5000)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

df = pd.read_csv('bank-full.csv', sep = ';')
df = df.drop(['duration', 'campaign'], axis = 1)
# removed columns related to the current marketing campaign, 
# since they introduce data leakage

df.head()

We know quite a lot about the customers, including personal data (such as job type or marital status) and their previous behaviour (such as whether they have a loan or their average yearly balance).

Image by author

The next step is to select a machine-learning model. There are two classes of models that are usually used when we need something easily interpretable:

decision trees,
linear or logistic regression.

Both options are feasible and can give us good models that can be easily implemented and interpreted. However, in this article, I would like to stick to the decision tree model because it produces actual rules, while logistic regression will give us probability as a weighted sum of features.

Data Preprocessing

As we’ve seen in the data, there are lots of categorical variables (such as education or marital status). Unfortunately, the sklearn decision tree implementation can’t handle categorical data, so we need to do some preprocessing.

Let’s start by transforming yes/no flags into integers.

for p in ['default', 'housing', 'loan', 'y']:
    df[p] = df[p].map(lambda x: 1 if x == 'yes' else 0)

The next step is to transform the month variable. We can use one-hot encoding for months, introducing flags like month_jan , month_feb , etc. However, there might be seasonal effects, and I think it would be more reasonable to convert months into integers following their order.

month_map = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
# I saved 5 mins by asking ChatGPT to do this mapping

df['month'] = df.month.map(lambda x: month_map[x] if x in month_map else x)

For all other categorical variables, let’s use one-hot encoding. We will discuss different strategies for category encoding later, but for now, let’s stick to the default approach.

The easiest way to do one-hot encoding is to leverage get_dummies function in pandas.

fin_df = pd.get_dummies(
  df, columns=['job', 'marital', 'education', 'poutcome', 'contact'], 
  dtype = int, # to convert to flags 0/1
  drop_first = False # to keep all possible values
)

This function transforms each categorical variable into a separate 1/0 column for each possible. We can see how it works for poutcome column.

fin_df.merge(df[['id', 'poutcome']])\
    .groupby(['poutcome', 'poutcome_unknown', 'poutcome_failure', 
      'poutcome_other', 'poutcome_success'], as_index = False).y.count()\
    .rename(columns = {'y': 'cases'})\
    .sort_values('cases', ascending = False)

Image by author

Our data is now ready, and it’s time to discuss how decision tree classifiers work.

Decision Tree Classifier: Theory

In this section, we’ll explore the theory behind the Decision Tree Classifier and build the algorithm from scratch. If you’re more interested in a practical example, feel free to skip ahead to the next part.

The easiest way to understand the decision tree model is to look at an example. So, let’s build a simple model based on our data. We will use DecisionTreeClassifier from sklearn.

feature_names = fin_df.drop(['y'], axis = 1).columns
model = sklearn.tree.DecisionTreeClassifier(
  max_depth = 2, min_samples_leaf = 1000)
model.fit(fin_df[feature_names], fin_df['y'])

The next step is to visualise the tree.

dot_data = sklearn.tree.export_graphviz(
    model, out_file=None, feature_names = feature_names, filled = True, 
    proportion = True, precision = 2 
    # to show shares of classes instead of absolute numbers
)

graph = graphviz.Source(dot_data)
graph

Image by author

So, we can see that the model is straightforward. It’s a set of binary splits that we can use as heuristics.

Let’s figure out how the classifier works under the hood. As usual, the best way to understand the model is to build the logic from scratch.

The cornerstone of any problem is the optimisation function. By default, in the decision tree classifier, we’re optimising the Gini coefficient. Imagine getting one random item from the sample and then the other. The Gini coefficient would equal the probability of the situation when these items are from different classes. So, our goal will be minimising the Gini coefficient.

In the case of just two classes (like in our example, where marketing intervention was either successful or not), the Gini coefficient is defined just by one parameter p , where p is the probability of getting an item from one of the classes. Here’s the formula:

\[\textbf{gini}(\textsf{p}) = 1 – \textsf{p}^2 – (1 – \textsf{p})^2 = 2 * \textsf{p} * (1 – \textsf{p}) \]

If our classification is ideal and we are able to separate the classes perfectly, then the Gini coefficient will be equal to 0. The worst-case scenario is when p = 0.5 , then the Gini coefficient is also equal to 0.5.

With the formula above, we can calculate the Gini coefficient for each leaf of the tree. To calculate the Gini coefficient for the whole tree, we need to combine the Gini coefficients of binary splits. For that, we can just get a weighted sum:

\[\textbf{gini}_{\textsf{total}} = \textbf{gini}_{\textsf{left}} * \frac{\textbf{n}_{\textsf{left}}}{\textbf{n}_{\textsf{left}} + \textbf{n}_{\textsf{right}}} + \textbf{gini}_{\textsf{right}} * \frac{\textbf{n}_{\textsf{right}}}{\textbf{n}_{\textsf{left}} + \textbf{n}_{\textsf{right}}}\]

Now that we know what value we’re optimising, we only need to define all possible binary splits, iterate through them and choose the best option.

Defining all possible binary splits is also quite straightforward. We can do it one by one for each parameter, sort possible values, and pick up thresholds between them. For example, for months (integer from 1 to 12).

Image by author

Let’s try to code it and see whether we will come to the same result. First, we will define functions that calculate the Gini coefficient for one dataset and the combination.

def get_gini(df):
    p = df.y.mean()
    return 2*p*(1-p)

print(get_gini(fin_df)) 
# 0.2065
# close to what we see at the root node of Decision Tree

def get_gini_comb(df1, df2):
    n1 = df1.shape[0]
    n2 = df2.shape[0]

    gini1 = get_gini(df1)
    gini2 = get_gini(df2)
    return (gini1*n1 + gini2*n2)/(n1 + n2)

The next step is to get all possible thresholds for one parameter and calculate their Gini coefficients.

import tqdm
def optimise_one_parameter(df, param):
    tmp = []
    possible_values = list(sorted(df[param].unique()))
    print(param)

    for i in tqdm.tqdm(range(1, len(possible_values))): 
        threshold = (possible_values[i-1] + possible_values[i])/2
        gini = get_gini_comb(df[df[param] <= threshold], 
          df[df[param] > threshold])
        tmp.append(
            {'param': param, 
            'threshold': threshold, 
            'gini': gini, 
            'sizes': (df[df[param] <= threshold].shape[0], df[df[param] > threshold].shape[0]))
            }
        )
    return pd.DataFrame(tmp)

The final step is to iterate through all features and calculate all possible splits.

tmp_dfs = []
for feature in feature_names:
    tmp_dfs.append(optimise_one_parameter(fin_df, feature))
opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', asceding = True).head(5)

Image by author

Wonderful, we’ve got the same result as in our DecisionTreeClassifier model. The optimal split is whether poutcome = success or not. We’ve reduced the Gini coefficient from 0.2065 to 0.1872.

To continue building the tree, we need to repeat the process recursively. For example, going down for the poutcome_success <= 0.5 branch:

tmp_dfs = []
for feature in feature_names:
    tmp_dfs.append(optimise_one_parameter(
      fin_df[fin_df.poutcome_success <= 0.5], feature))

opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', ascending = True).head(5)

Image by author

The only question we still need to discuss is the stopping criteria. In our initial example, we’ve used two conditions:

max_depth = 2 — it just limits the maximum depth of the tree,
min_samples_leaf = 1000 prevents us from getting leaf nodes with less than 1K samples. Because of this condition, we’ve chosen a binary split by contact_unknown even though age led to a lower Gini coefficient.

Also, I usually limit the min_impurity_decrease that prevent us from going further if the gains are too small. By gains, we mean the decrease of the Gini coefficient.

So, we’ve understood how the Decision Tree Classifier works, and now it’s time to use it in practice.

If you’re interested to see how Decision Tree Regressor works in all detail, you can look it up in my previous article.

Decision Trees: practice

We’ve already built a simple tree model with two layers, but it’s definitely not enough since it’s too simple to get all the insights from the data. Let’s train another Decision Tree by limiting the number of samples in leaves and decreasing impurity (reduction of Gini coefficient).

model = sklearn.tree.DecisionTreeClassifier(
  min_samples_leaf = 1000, min_impurity_decrease=0.001)
model.fit(fin_df[features], fin_df['y'])

dot_data = sklearn.tree.export_graphviz(
    model, out_file=None, feature_names = features, filled = True, 
    proportion = True, precision=2, impurity = True)

graph = graphviz.Source(dot_data)

# saving graph to png file
png_bytes = graph.pipe(format='png')
with open('decision_tree.png','wb') as f:
    f.write(png_bytes)

Image by author

That’s it. We’ve got our rules to split customers into groups (leaves). Now, we can iterate through groups and see which groups of customers we want to contact. Even though our model is relatively small, it’s daunting to copy all conditions from the image. Luckily, we can parse the tree structure and get all the groups from the model.

The Decision Tree classifier has an attribute tree_ that will allow us to get access to low-level attributes of the tree, such as node_count .

n_nodes = model.tree_.node_count
print(n_nodes)
# 13

The tree_ variable also stores the entire tree structure as parallel arrays, where the i_th element of each array stores the information about the node i. For the root i equals to 0.

Here are the arrays we have to represent the tree structure:

children_left and children_right — IDs of left and right nodes, respectively; if the node is a leaf, then -1.
feature — feature used to split the node i .
threshold — threshold value used for the binary split of the node i .
n_node_samples — number of training samples that reached the node i .
values — shares of samples from each class.

Let’s save all these arrays.

children_left = model.tree_.children_left
# [ 1,  2,  3,  4,  5,  6, -1, -1, -1, -1, -1, -1, -1]
children_right = model.tree_.children_right
# [12, 11, 10,  9,  8,  7, -1, -1, -1, -1, -1, -1, -1]
features = model.tree_.feature
# [30, 34,  0,  3,  6,  6, -2, -2, -2, -2, -2, -2, -2]
thresholds = model.tree_.threshold
# [ 0.5,  0.5, 59.5,  0.5,  6.5,  2.5, -2. , -2. , -2. , -2. , -2. , -2. , -2. ]
num_nodes = model.tree_.n_node_samples
# [45211, 43700, 30692, 29328, 14165,  4165,  2053,  2112, 10000, 
#  15163,  1364, 13008,  1511] 
values = model.tree_.value
# [[[0.8830152 , 0.1169848 ]],
# [[0.90135011, 0.09864989]],
# [[0.87671054, 0.12328946]],
# [[0.88550191, 0.11449809]],
# [[0.8530886 , 0.1469114 ]],
# [[0.76686675, 0.23313325]],
# [[0.87043351, 0.12956649]],
# [[0.66619318, 0.33380682]],
# [[0.889     , 0.111     ]],
# [[0.91578184, 0.08421816]],
# [[0.68768328, 0.31231672]],
# [[0.95948647, 0.04051353]],
# [[0.35274653, 0.64725347]]]

It will be more convenient for us to work with a hierarchical view of the tree structure, so let’s iterate through all nodes and, for each node, save the parent node ID and whether it was a right or left branch.

hierarchy = {}

for node_id in range(n_nodes):
  if children_left[node_id] != -1: 
    hierarchy[children_left[node_id]] = {
      'parent': node_id, 
      'condition': 'left'
    }
  
  if children_right[node_id] != -1:
      hierarchy[children_right[node_id]] = {
       'parent': node_id, 
       'condition': 'right'
  }

print(hierarchy)
# {1: {'parent': 0, 'condition': 'left'},
# 12: {'parent': 0, 'condition': 'right'},
# 2: {'parent': 1, 'condition': 'left'},
# 11: {'parent': 1, 'condition': 'right'},
# 3: {'parent': 2, 'condition': 'left'},
# 10: {'parent': 2, 'condition': 'right'},
# 4: {'parent': 3, 'condition': 'left'},
# 9: {'parent': 3, 'condition': 'right'},
# 5: {'parent': 4, 'condition': 'left'},
# 8: {'parent': 4, 'condition': 'right'},
# 6: {'parent': 5, 'condition': 'left'},
# 7: {'parent': 5, 'condition': 'right'}}

The next step is to filter out the leaf nodes since they are terminal and the most interesting for us as they define the customer segments.

leaves = []
for node_id in range(n_nodes):
    if (children_left[node_id] == -1) and (children_right[node_id] == -1):
        leaves.append(node_id)
print(leaves)
# [6, 7, 8, 9, 10, 11, 12]
leaves_df = pd.DataFrame({'node_id': leaves})

The next step is to determine all the conditions applied to each group since they will define our customer segments. The first function get_condition will give us the tuple of feature, condition type and threshold for a node.

def get_condition(node_id, condition, features, thresholds, feature_names):
    # print(node_id, condition)
    feature = feature_names[features[node_id]]
    threshold = thresholds[node_id]
    cond = '>' if condition == 'right'  else '<='
    return (feature, cond, threshold)

print(get_condition(0, 'left', features, thresholds, feature_names)) 
# ('poutcome_success', '<=', 0.5)

print(get_condition(0, 'right', features, thresholds, feature_names))
# ('poutcome_success', '>', 0.5)

The next function will allow us to recursively go from the leaf node to the root and get all the binary splits.

def get_decision_path_rec(node_id, decision_path, hierarchy):
  if node_id == 0:
    yield decision_path 
  else:
    parent_id = hierarchy[node_id]['parent']
    condition = hierarchy[node_id]['condition']
    for res in get_decision_path_rec(parent_id, decision_path + [(parent_id, condition)], hierarchy):
        yield res

decision_path = list(get_decision_path_rec(12, [], hierarchy))[0]
print(decision_path) 
# [(0, 'right')]

fmt_decision_path = list(map(
  lambda x: get_condition(x[0], x[1], features, thresholds, feature_names), 
  decision_path))
print(fmt_decision_path)
# [('poutcome_success', '>', 0.5)]

Let’s save the logic of executing the recursion and formatting into a wrapper function.

def get_decision_path(node_id, features, thresholds, hierarchy, feature_names):
  decision_path = list(get_decision_path_rec(node_id, [], hierarchy))[0]
  return list(map(lambda x: get_condition(x[0], x[1], features, thresholds, 
    feature_names), decision_path))

We’ve learned how to get each node’s binary split conditions. The only remaining logic is to combine the conditions.

def get_decision_path_string(node_id, features, thresholds, hierarchy, 
  feature_names):
  conditions_df = pd.DataFrame(get_decision_path(node_id, features, thresholds, hierarchy, feature_names))
  conditions_df.columns = ['feature', 'condition', 'threshold']

  left_conditions_df = conditions_df[conditions_df.condition == '<=']
  right_conditions_df = conditions_df[conditions_df.condition == '>']

  # deduplication 
  left_conditions_df = left_conditions_df.groupby(['feature', 'condition'], as_index = False).min()
  right_conditions_df = right_conditions_df.groupby(['feature', 'condition'], as_index = False).max()
  
  # concatination
  fin_conditions_df = pd.concat([left_conditions_df, right_conditions_df])\
      .sort_values(['feature', 'condition'], ascending = False)
  
  # formatting 
  fin_conditions_df['cond_string'] = list(map(
      lambda x, y, z: '(%s %s %.2f)' % (x, y, z),
      fin_conditions_df.feature,
      fin_conditions_df.condition,
      fin_conditions_df.threshold
  ))
  return ' and '.join(fin_conditions_df.cond_string.values)

print(get_decision_path_string(12, features, thresholds, hierarchy, 
  feature_names))
# (poutcome_success > 0.50)

Now, we can calculate the conditions for each group.

leaves_df['condition'] = leaves_df['node_id'].map(
  lambda x: get_decision_path_string(x, features, thresholds, hierarchy, 
  feature_names)
)

The last step is to add their size and conversion to the groups.

leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total)\
  .map(lambda x: int(round(x/100)))
leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()

Now, we can use these rules to make decisions. We can sort groups by conversion (probability of successful contact) and pick the customers with the highest probability.

leaves_df.sort_values('conversion', ascending = False)\
  .drop('node_id', axis = 1).set_index('condition')

Image by author

Imagine we have resources to contact only around 10% of our user base, we can focus on the first three groups. Even with such a limited capacity, we would expect to get almost 40% conversion — it’s a really good result, and we’ve achieved it with just a bunch of straightforward heuristics.

In real life, it’s also worth testing the model (or heuristics) before deploying it in production. I would split the training dataset into training and validation parts (by time to avoid leakage) and see the heuristics performance on the validation set to have a better view of the actual model quality.

Working with high cardinality categories

Another topic that is worth discussing in this context is category encoding, since we have to encode the categorical variables for sklearn implementation. We’ve used a straightforward approach with one-hot encoding, but in some cases, it doesn’t work.

Imagine we also have a region in the data. I’ve synthetically generated English cities for each row. We have 155 unique regions, so the number of features has increased to 190.

model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 100, min_impurity_decrease=0.001)
model.fit(fin_df[feature_names], fin_df['y'])

So, the basic tree now has lots of conditions based on regions and it’s not convenient to work with them.

Image by author

In such a case, it might not be meaningful to explode the number of features, and it’s time to think about encoding. There’s a comprehensive article, “Categorically: Don’t explode — encode!”, that shares a bunch of different options to handle high cardinality categorical variables. I think the most feasible ones in our case will be the following two options:

Count or Frequency Encoder that shows good performance in benchmarks. This encoding assumes that categories of similar size would have similar characteristics.
Target Encoder, where we can encode the category by the mean value of the target variable. It will allow us to prioritise segments with higher conversion and deprioritise segments with lower. Ideally, it would be nice to use historical data to get the averages for the encoding, but we will use the existing dataset.

However, it will be interesting to test different approaches, so let’s split our dataset into train and test, saving 10% for validation. For simplicity, I’ve used one-hot encoding for all columns except for region (since it has the highest cardinality).

from sklearn.model_selection import train_test_split
fin_df = pd.get_dummies(df, columns=['job', 'marital', 'education', 
  'poutcome', 'contact'], dtype = int, drop_first = False)
train_df, test_df = train_test_split(fin_df,test_size=0.1, random_state=42)
print(train_df.shape[0], test_df.shape[0])
# (40689, 4522)

For convenience, let’s combine all the logic for parsing the tree into one function.

def get_model_definition(model, feature_names):
  n_nodes = model.tree_.node_count
  children_left = model.tree_.children_left
  children_right = model.tree_.children_right
  features = model.tree_.feature
  thresholds = model.tree_.threshold
  num_nodes = model.tree_.n_node_samples
  values = model.tree_.value

  hierarchy = {}

  for node_id in range(n_nodes):
      if children_left[node_id] != -1: 
          hierarchy[children_left[node_id]] = {
            'parent': node_id, 
            'condition': 'left'
          }
    
      if children_right[node_id] != -1:
            hierarchy[children_right[node_id]] = {
             'parent': node_id, 
             'condition': 'right'
            }

  leaves = []
  for node_id in range(n_nodes):
      if (children_left[node_id] == -1) and (children_right[node_id] == -1):
          leaves.append(node_id)
  leaves_df = pd.DataFrame({'node_id': leaves})
  leaves_df['condition'] = leaves_df['node_id'].map(
    lambda x: get_decision_path_string(x, features, thresholds, hierarchy, feature_names)
  )

  leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
  leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
  leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.total).map(lambda x: int(round(x/100)))
  leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
  leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()
  leaves_df = leaves_df.sort_values('conversion', ascending = False)\
    .drop('node_id', axis = 1).set_index('condition')
  leaves_df['cum_share_of_total'] = leaves_df['share_of_total'].cumsum()
  leaves_df['cum_share_of_converted'] = leaves_df['share_of_converted'].cumsum()
  return leaves_df

Let’s create an encodings data frame, calculating frequencies and conversions.

region_encoding_df = train_df.groupby('region', as_index = False)\
  .aggregate({'id': 'count', 'y': 'mean'}).rename(columns = 
    {'id': 'region_count', 'y': 'region_target'})

Then, merge it into our training and validation sets. For the validation set, we will also fill NAs as averages.

train_df = train_df.merge(region_encoding_df, on = 'region')

test_df = test_df.merge(region_encoding_df, on = 'region', how = 'left')
test_df['region_target'] = test_df['region_target']\
  .fillna(region_encoding_df.region_target.mean())
test_df['region_count'] = test_df['region_count']\
  .fillna(region_encoding_df.region_count.mean())

Now, we can fit the models and get their structures.

count_feature_names = train_df.drop(
  ['y', 'id', 'region_target', 'region'], axis = 1).columns
target_feature_names = train_df.drop(
  ['y', 'id', 'region_count', 'region'], axis = 1).columns
print(len(count_feature_names), len(target_feature_names))
# (36, 36)

count_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
count_model.fit(train_df[count_feature_names], train_df['y'])

target_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
target_model.fit(train_df[target_feature_names], train_df['y'])

count_model_def_df = get_model_definition(count_model, count_feature_names)
target_model_def_df = get_model_definition(target_model, target_feature_names)

Let’s look at the structures and select the top categories up to 10–15% of our target audience. We can also apply these conditions to our validation sets to test our approach in practice.

Let’s start with Count Encoder.

Image by author

count_selected_df = test_df[
    (test_df.poutcome_success > 0.50) | 
    ((test_df.poutcome_success <= 0.50) & (test_df.age > 60.50)) | 
    ((test_df.region_count > 3645.50) & (test_df.region_count <= 8151.50) & 
         (test_df.poutcome_success <= 0.50) & (test_df.contact_cellular > 0.50) & (test_df.age <= 60.50))
]

print(count_selected_df.shape[0], count_selected_df.y.sum())
# (508, 227)

We can also see what regions have been selected, and it’s only Manchester.

Image by author

Let’s continue with the Target encoding.

Image by author

target_selected_df = test_df[
    ((test_df.region_target > 0.21) & (test_df.poutcome_success > 0.50)) | 
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month <= 6.50) & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month > 8.50) & (test_df.housing <= 0.50) 
         & (test_df.contact_unknown <= 0.50)) |
    ((test_df.region_target <= 0.21) & (test_df.poutcome_success > 0.50)) |
    ((test_df.region_target > 0.21) & (test_df.poutcome_success <= 0.50) & (test_df.month > 6.50) & (test_df.month <= 8.50) 
         & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50))
]

print(target_selected_df.shape[0], target_selected_df.y.sum())
# (502, 248)

We see a slightly lower number of selected users for communication but a significantly higher number of conversions: 248 vs. 227 (+9.3%).

Let’s also look at the selected categories. We see that the model picked up all the cities with high conversions (Manchester, Liverpool, Bristol, Leicester, and New Castle), but there are also many small regions with high conversions solely due to chance.

region_encoding_df[region_encoding_df.region_target > 0.21]\
  .sort_values('region_count', ascending = False)

Image by author

In our case, it doesn’t impact much since the share of such small cities is low. However, if you have way more small categories, you might see significant drawbacks of overfitting. Target Encoding might be tricky at this point, so it’s worth keeping an eye on the output of your model.

Luckily, there’s an approach that can help you overcome this issue. Following the article “Encoding Categorical Variables: A Deep Dive into Target Encoding”, we can add smoothing. The idea is to combine the group’s conversion rate with the overall average: the larger the group, the more weight its data carries, while smaller segments will lean more towards the global average.

First, I’ve selected the parameters that make sense for our distribution, looking at a bunch of options. I chose to use the global average for the groups under 100 people. This part is a bit subjective, so use common sense and your knowledge about the business domain.

import numpy as np
import matplotlib.pyplot as plt

global_mean = train_df.y.mean()

k = 100
f = 10
smooth_df = pd.DataFrame({'region_count':np.arange(1, 100001, 1) })
smooth_df['smoothing'] = (1 / (1 + np.exp(-(smooth_df.region_count - k) / f)))

ax = plt.scatter(smooth_df.region_count, smooth_df.smoothing)
plt.xscale('log')
plt.ylim([-.1, 1.1])
plt.title('Smoothing')

Image by author

Then, we can calculate, based on the selected parameters, the smoothing coefficients and blended averages.

region_encoding_df['smoothing'] = (1 / (1 + np.exp(-(region_encoding_df.region_count - k) / f)))
region_encoding_df['region_target'] = region_encoding_df.smoothing * region_encoding_df.raw_region_target \
    + (1 - region_encoding_df.smoothing) * global_mean

Then, we can fit another model with smoothed target category encoding.

train_df = train_df.merge(region_encoding_df[['region', 'region_target']], 
  on = 'region')
test_df = test_df.merge(region_encoding_df[['region', 'region_target']], 
  on = 'region', how = 'left')
test_df['region_target'] = test_df['region_target']\
  .fillna(region_encoding_df.region_target.mean())

target_v2_feature_names = train_df.drop(['y', 'id', 'region'], axis = 1)\
  .columns

target_v2_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500, 
  min_impurity_decrease=0.001)
target_v2_model.fit(train_df[target_v2_feature_names], train_df['y'])
target_v2_model_def_df = get_model_definition(target_v2_model, 
  target_v2_feature_names)

Image by author

target_v2_selected_df = test_df[
    ((test_df.region_target > 0.12) & (test_df.poutcome_success > 0.50)) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month <= 6.50) & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month > 8.50) & (test_df.housing <= 0.50) 
         & (test_df.contact_unknown <= 0.50)) | 
    ((test_df.region_target <= 0.12) & (test_df.poutcome_success > 0.50) ) | 
    ((test_df.region_target > 0.12) & (test_df.poutcome_success <= 0.50) & (test_df.month > 6.50) & (test_df.month <= 8.50) 
         & (test_df.housing <= 0.50) & (test_df.contact_unknown <= 0.50) )
]

target_v2_selected_df.shape[0], target_v2_selected_df.y.sum()
# (500, 247)

We can see that we’ve eliminated the small cities and prevented overfitting in our model while keeping roughly the same performance, capturing 247 conversions.

region_encoding_df[region_encoding_df.region_target > 0.12]

Image by author

You can also use TargetEncoder from sklearn, which smoothes and mixes the category and global means depending on the segment size. However, it also adds random noise, which is not ideal for our case of heuristics.

You can find the full code on GitHub.

Summary

In this article, we explored how to extract simple “rules” from data and use them to inform business decisions. We generated heuristics using a Decision Tree Classifier and touched on the important topic of categorical encoding since decision tree algorithms require categorical variables to be converted.

We saw that this rule-based approach can be surprisingly effective, helping you reach business decisions quickly. However, it’s worth noting that this simplistic approach has its drawbacks:

We are trading off the model’s power and accuracy for its simplicity and interpretability, so if you’re optimising for accuracy, choose another approach.
Even though we’re using a set of static heuristics, your data still can change, and they might become outdated, so you need to recheck your model from time to time.

Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

Dataset: Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306

The post Mining Rules from Data appeared first on Towards Data Science.

A Data Scientist’s Guide to Docker Containers

Jonte Dancker — Tue, 08 Apr 2025 20:02:45 +0000

For a ML model to be useful it needs to run somewhere. This somewhere is most likely not your local machine. A not-so-good model that runs in a production environment is better than a perfect model that never leaves your local machine.

However, the production machine is usually different from the one you developed the model on. So, you ship the model to the production machine, but somehow the model doesn’t work anymore. That’s weird, right? You tested everything on your local machine and it worked fine. You even wrote unit tests.

What happened? Most likely the production machine differs from your local machine. Perhaps it does not have all the needed dependencies installed to run your model. Perhaps installed dependencies are on a different version. There can be many reasons for this.

How can you solve this problem? One approach could be to exactly replicate the production machine. But that is very inflexible as for each new production machine you would need to build a local replica.

A much nicer approach is to use Docker containers.

Docker is a tool that helps us to create, manage, and run code and applications in containers. A container is a small isolated computing environment in which we can package an application with all its dependencies. In our case our ML model with all the libraries it needs to run. With this, we do not need to rely on what is installed on the host machine. A Docker Container enables us to separate applications from the underlying infrastructure.

For example, we package our ML model locally and push it to the cloud. With this, Docker helps us to ensure that our model can run anywhere and anytime. Using Docker has several advantages for us. It helps us to deliver new models faster, improve reproducibility, and make collaboration easier. All because we have exactly the same dependencies no matter where we run the container.

As Docker is widely used in the industry Data Scientists need to be able to build and run containers using Docker. Hence, in this article, I will go through the basic concept of containers. I will show you all you need to know about Docker to get started. After we have covered the theory, I will show you how you can build and run your own Docker container.

What is a container?

A container is a small, isolated environment in which everything is self-contained. The environment packages up all code and dependencies.

A container has five main features.

self-contained: A container isolates the application/software, from its environment/infrastructure. Due to this isolation, we do not need to rely on any pre-installed dependencies on the host machine. Everything we need is part of the container. This ensures that the application can always run regardless of the infrastructure.
isolated: The container has a minimal influence on the host and other containers and vice versa.
independent: We can manage containers independently. Deleting a container does not affect other containers.
portable: As a container isolates the software from the hardware, we can run it seamlessly on any machine. With this, we can move it between machines without a problem.
lightweight: Containers are lightweight as they share the host machine’s OS. As they do not require their own OS, we do not need to partition the hardware resource of the host machine.

This might sound similar to virtual machines. But there is one big difference. The difference is in how they use their host computer’s resources. Virtual machines are an abstraction of the physical hardware. They partition one server into multiple. Thus, a VM includes a full copy of the OS which takes up more space.

In contrast, containers are an abstraction at the application layer. All containers share the host’s OS but run in isolated processes. Because containers do not contain an OS, they are more efficient in using the underlying system and resources by reducing overhead.

Containers vs. Virtual Machines (Image by the author based on docker.com)

Now we know what containers are. Let’s get some high-level understanding of how Docker works. I will briefly introduce the technical terms that are used often.

What is Docker?

To understand how Docker works, let’s have a brief look at its architecture.

Docker uses a client-server architecture containing three main parts: A Docker client, a Docker daemon (server), and a Docker registry.

The Docker client is the primary way to interact with Docker through commands. We use the client to communicate through a REST API with as many Docker daemons as we want. Often used commands are docker run, docker build, docker pull, and docker push. I will explain later what they do.

The Docker daemon manages Docker objects, such as images and containers. The daemon listens for Docker API requests. Depending on the request the daemon builds, runs, and distributes Docker containers. The Docker daemon and client can run on the same or different systems.

The Docker registry is a centralized location that stores and manages Docker images. We can use them to share images and make them accessible to others.

Sounds a bit abstract? No worries, once we get started it will be more intuitive. But before that, let’s run through the needed steps to create a Docker container.

Docker Architecture (Image by author based on docker.com)

What do we need to create a Docker container?

It is simple. We only need to do three steps:

create a Dockerfile
build a Docker Image from the Dockerfile
run the Docker Image to create a Docker container

Let’s go step-by-step.

A Dockerfile is a text file that contains instructions on how to build a Docker Image. In the Dockerfile we define what the application looks like and its dependencies. We also state what process should run when launching the Docker container. The Dockerfile is composed of layers, representing a portion of the image’s file system. Each layer either adds, removes, or modifies the layer below it.

Based on the Dockerfile we create a Docker Image. The image is a read-only template with instructions to run a Docker container. Images are immutable. Once we create a Docker Image we cannot modify it anymore. If we want to make changes, we can only add changes on top of existing images or create a new image. When we rebuild an image, Docker is clever enough to rebuild only layers that have changed, reducing the build time.

A Docker Container is a runnable instance of a Docker Image. The container is defined by the image and any configuration options that we provide when creating or starting the container. When we remove a container all changes to its internal states are also removed if they are not stored in a persistent storage.

Using Docker: An example

With all the theory, let’s get our hands dirty and put everything together.

As an example, we will package a simple ML model with Flask in a Docker container. We can then run requests against the container and receive predictions in return. We will train a model locally and only load the artifacts of the trained model in the Docker Container.

I will go through the general workflow needed to create and run a Docker container with your ML model. I will guide you through the following steps:

build model
create requirements.txt file containing all dependencies
create Dockerfile
build docker image
run container

Before we get started, we need to install Docker Desktop. We will use it to view and run our Docker containers later on.

1. Build a model

First, we will train a simple RandomForestClassifier on scikit-learn’s Iris dataset and then store the trained model.

Second, we build a script making our model available through a Rest API, using Flask. The script is also simple and contains three main steps:

extract and convert the data we want to pass into the model from the payload JSON
load the model artifacts and create an onnx session and run the model
return the model’s predictions as json

I took most of the code from here and here and made only minor changes.

2. Create requirements

Once we have created the Python file we want to execute when the Docker container is running, we must create a requirements.txt file containing all dependencies. In our case, it looks like this:

3. Create Dockerfile

The last thing we need to prepare before being able to build a Docker Image and run a Docker container is to write a Dockerfile.

The Dockerfile contains all the instructions needed to build the Docker Image. The most common instructions are

FROM — this specifies the base image that the build will extend.
WORKDIR — this instruction specifies the “working directory” or the path in the image where files will be copied and commands will be executed.
COPY — this instruction tells the builder to copy files from the host and put them into the container image.
RUN — this instruction tells the builder to run the specified command.
ENV — this instruction sets an environment variable that a running container will use.
EXPOSE — this instruction sets the configuration on the image that indicates a port the image would like to expose.
USER — this instruction sets the default user for all subsequent instructions.
CMD ["", ""] — this instruction sets the default command a container using this image will run.

With these, we can create the Dockerfile for our example. We need to follow the following steps:

Determine the base image
Install application dependencies
Copy in any relevant source code and/or binaries
Configure the final image

Let’s go through them step by step. Each of these steps results in a layer in the Docker Image.

First, we specify the base image that we then build upon. As we have written in the example in Python, we will use a Python base image.

Second, we set the working directory into which we will copy all the files we need to be able to run our ML model.

Third, we refresh the package index files to ensure that we have the latest available information about packages and their versions.

Fourth, we copy in and install the application dependencies.

Fifth, we copy in the source code and all other files we need. Here, we also expose port 8080, which we will use for interacting with the ML model.

Sixth, we set a user, so that the container does not run as the root user

Seventh, we define that the example.py file will be executed when we run the Docker container. With this, we create the Flask server to run our requests against.

Besides creating the Dockerfile, we can also create a .dockerignore file to improve the build speed. Similar to a .gitignore file, we can exclude directories from the build context.

If you want to know more, please go to docker.com.

4. Create Docker Image

After we created all the files we needed to build the Docker Image.

To build the image we first need to open Docker Desktop. You can check if Docker Desktop is running by running docker ps in the command line. This command shows you all running containers.

To build a Docker Image, we need to be at the same level as our Dockerfile and requirements.txt file. We can then run docker build -t our_first_image . The -t flag indicates the name of the image, i.e., our_first_image, and the . tells us to build from this current directory.

Once we built the image we can do several things. We can

view the image by running docker image ls
view the history or how the image was created by running docker image history
push the image to a registry by running docker push

5. Run Docker Container

Once we have built the Docker Image, we can run our ML model in a container.

For this, we only need to execute docker run -p 8080:8080 in the command line. With -p 8080:8080 we connect the local port (8080) with the port in the container (8080).

If the Docker Image doesn’t expose a port, we could simply run docker run . Instead of using the image_name, we can also use the image_id.

Okay, once the container is running, let’s run a request against it. For this, we will send a payload to the endpoint by running curl X POST http://localhost:8080/invocations -H "Content-Type:application/json" -d @.path/to/sample_payload.json

Conclusion

In this article, I showed you the basics of Docker Containers, what they are, and how to build them yourself. Although I only scratched the surface it should be enough to get you started and be able to package your next model. With this knowledge, you should be able to avoid the “it works on my machine” problems.

I hope that you find this article useful and that it will help you become a better Data Scientist.

See you in my next article and/or leave a comment.

The post A Data Scientist’s Guide to Docker Containers appeared first on Towards Data Science.

How I Would Learn To Code (If I Could Start Over)

Egor Howell — Fri, 04 Apr 2025 18:43:36 +0000

According to various sources, the average salary for Coding jobs is ~£47.5k in the UK, which is ~35% higher than the median salary of about £35k.

So, coding is a very valuable skill that will earn you more money, not to mention it’s really fun.

I have been coding professionally now for 4 years, working as a data scientist and machine learning engineer and in this post, I will explain how I would learn to code if I had to do it all over again.

My journey

I still remember the time I wrote my first bit of code.
It was 9am on the first day of my physics undergrad, and we were in the computer lab.

The professor explained that computation is an integral part of modern physics as it allows us to run large-scale simulations of everything from subatomic particle collisions to the movement of galaxies.

It sounded amazing.

And the way we started this process was by going through a textbook to learn Fortran.

Yes, you heard that right.

My first programming language was Fortran, specifically Fortran 90.
I learned DO loops before FOR loops. I am definitely a rarity in this case.

In that first lab session, I remember writing “Hello World” as is the usual rite of passage and thinking, “Big woop.”

This is how you write “Hello World” in Fortran in case you are interested.

program hello
print *, 'Hello World!'
end program hello

I actually really struggled to code in Fortran and didn’t do that well on tests we had, which put me off coding.

I still have some old coding projects in Fortran on my GitHub that you can check out.

Looking back, the learning curve to coding is quite steep, but it really does compound, and eventually, it will just click.

I didn’t realise this at the time and actively avoided programming modules in my physics degree, which I regret in hindsight as my progress would have been much quicker.

During my third year, I had to do a research placement as part of my master’s. The company I chose to work for/with used a graphical programming language called LabVIEW to run and manage their experiments.

LabVIEW is based on something called “G” and taught me to think of programming differently than script-based.

However, I haven’t used it since and probably never will, but it was cool to learn then.

I did enjoy the research year somewhat, but the pace at which research moves, at least in physics, is painfully slow. Nothing like the “heyday” from the early 20th century I envisioned.

One day after work a video was recommended to me on my YouTube home page.

For those of you unaware, this was a documentary about DeepMind’s AI AlphaGo that beat the best GO player in the world. Most people thought that an AI could never be good at GO.

From the video, I started to understand how AI worked and learn about neural networks, reinforcement learning, and deep learning.
I found it all so interesting, similar to physics research in the early 20th century.

Ultimately, this is when I started studying for a career in Data Science and machine learning, where I needed to teach myself Python and SQL.

This is where I so-called “fell in love” with coding.
I saw its real potential in actually solving problems, but the main thing was that I had a motivated reason to learn. I was studying to break into a career I wanted to be in, which really drove me.

I then became a data scientist for three years and am now a Machine Learning engineer. During this time, I worked extensively with Python and SQL.

Until a few months ago, those were the only programming languages I knew. I did learn other tools, such as bash/z-shell, AWS, docker, data bricks, snowflake, etc. but not any other “proper” programming languages.

In my spare time, I dabbled a bit with C a couple of years ago, but I have forgotten virtually all of it now. I have some basic scripts on my GitHub if you are interested.

However, in my new role that I started a couple of months ago, I will be using Rust and GO, which I am very much looking forward to learning.

If you are interested in my entire journey to becoming a data scientist and machine learning engineer, you can read about it below:

How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp)

Choose a language

I always recommend starting with a single language.

According to TestGorilla, there are over 8,000 programming languages, so how do you pick one?

Well, I would argue that many of these are useless for most jobs and have probably been developed as pet projects or for really niche cases.

You could choose your first language based on popularity. The Stack Overflow 2024 survey has great information on this. The most popular languages are JavaScript, Python, SQL, and Java.

However, the way I recommend you choose your first language should be based on what you want to do or work as.

Front-end web — JavaScript, HTML, CSS
Back-end web — Java, C#, Python, PHP or GO
iOS/macOS apps — Swift
Andriod apps — Kotlin or Java
Games — C++ or C
Embedded Systems — C or C++
Data science/machine learning / AI — Python and SQL

As I wanted to work in the AI/ML space, I focused my energy mainly on Python and some on SQL. It was probably a 90% / 10% split as SQL is smaller and easier to learn.

To this day, I still only know Python and SQL to a “professional” standard, but that’s fine, as pretty much the whole machine-learning community requires these languages.

This shows that you don’t need to know many languages; I have progressed quite far in my career, only knowing two to a significant depth. Of course, it would vary by sector, but the main point still stands.

So, pick a field you want to enter and choose the most in-demand and relevant language in that field.

Learn the bare minimum

The biggest mistake I see beginners make is getting stuck in “tutorial hell.”

This is where you take course after course but never branch out on your own.

I recommend taking a maximum of two courses on a language — literally any intro course would do — and then starting to build immediately.

And I literally mean, build your own projects and experience the struggle because that’s where learning is done.

You won’t know how to write functions until you do it yourself, you won’t know how to create classes until you do it yourself, and you literally won’t understand loops until you implement them yourself.

So, learn the bare minimum and immediately start experimenting; I promise it will at least 2x your learning curve.

You probably have heard this advice a lot, but in reality it is that simple.

I always say that most things in life are simple but hard to do, especially in programming.

Avoid trends

When I say avoid trends, I don’t mean not to focus on areas that are doing well or in demand in the market.

What I am saying is that when you pick a certain language or specialism, stick with it.

Programming languages all share similar concepts and patterns, so when you learn one, you indirectly improve your ability to pick up another later.

But you still should focus on one language for at least a few months.

Don’t develop “shiny object syndrome” and chase the latest technologies; it’s a game that you will unfortunately lose.

There have been so many “distracting” technologies, such as blockchain, Web3, AI, the list goes on.

Instead, focus on the fundamentals:

Data types
Design patterns
Object-oriented programming
Data structures and algorithms
Problem-solving skills

These topics transcend individual programming languages and are much better to master than the latest Javascript framework!

It’s much better to have a strong understanding of one area than try to learn everything. Not only is this more manageable, but it is also better for your long-term career.

As I said earlier, I have progressed quite well in my career by only knowing Python and SQL, as I learned the required technologies for the field and didn’t get distracted.

I can’t stress how much leverage you will have in your career if you document your learning publicly.

Document your learning

I don’t know why more people don’t do this. Sharing what I have learned online has been the biggest game changer for my career.

Literally committing your code on GitHub is enough, but I really recommend posting on LinkedIn or X, and ideally, you should create blog posts to help you cement your understanding and show off you knowledge to employers.

When I interview candidates, if they have some sort of online presence showing their learnings, that’s immediately a tick in my box and an extra edge over other applicants.

It shows enthusiasm and passion, not to mention increasing your surface area of serendipity.

I know many people are scared to do this, but you are suffering from the spotlight effect. Wikipedia defines this as:

The spotlight effect is the psychological phenomenon by which people tend to believe they are being noticed more than they really are.

No one literally cares if you post online or think about you as much as 1% as you think.

So, start posting.

What about AI?

I could spend hours discussing why AI is not an immediate risk for anyone who wants to work in the coding profession.

You should embrace AI as part of your toolkit, but that’s as far as it will go, and it will definitely not replace programmers in 5 years.

Unless an AGI breakthrough suddenly occurs in the next decade, which is highly unlikely.

I personally doubt the answer to AGI is the cross-entropy loss function, which is what is used in most LLMs nowadays.

It has been shown time and time again that these AI models lack strong mathematical reasoning abilities, which is one of the most fundamental skills to being a good coder.

Even the so-called “software engineer killer” Devin is not as good as the creators initially marketed it.

Most companies are simply trying to boost their investment by hyping AI, and their results are often over-exaggerated with controversial benchmark testing.

When I was building a website, ChatGPT even struggled with simple HTML and CSS, which you can argue is its bread and butter!

Overall, don’t worry about AI if you want to work as a coder; there is much, much bigger fish to fry before we cross that bridge!

NeetCode has done a great video explaining how current AI is incapable of replacing programmers.

Another thing!

Join my free newsletter, Dishing the Data, where I share weekly tips, insights, and advice from my experience as a practicing data scientist. Plus, as a subscriber, you’ll get my FREE Data Science Resume Template!

Connect with me

The post How I Would Learn To Code (If I Could Start Over) appeared first on Towards Data Science.

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{ap}{1+a}\)
B	1	\(\frac{1+ap}{1+a}\)
C	0	\(\frac{ap}{1+a}\)
D	1	\(\frac{1+ap}{1+a}\)
E	0	\(\frac{ap}{1+a}\)

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)

Machine Learning | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Training a Conversational Speech Model

But what does it really take to generate audio?

Preprocessing audio

What is Audio Quantization?

Vector Quantization

Residual Vector Quantization

Acoustic vs Semantic Codebooks

Audio Decoder

In Summary

References and Must-read papers

Learnings from a Machine Learning Engineer — Part 6: The Human Side

Overview

AI/ML Engineer

Understand the business needs

Communication

MLOps team

Subject matter experts

Think like a computer

Stakeholders

End-users

Think like a human

Explainability

Marketing team

Leadership team

Conclusion

Closing remarks

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Is it the same? Is it different?

Can CNNs learn same-different relationships?

Conclusions

Another thing!

Reference

The Invisible Revolution: How Vectors Are (Re)defining Business Success

What are vectors, anyway?

The Mathematics of Meaning (Kings and Queens)

A Tale of Distances, Angles, and Dinner Parties

Euclidean Distance: Straightforward, but Limited

Cosine Similarity: Focused on Direction

Hiking up the Euclidian Mountain Trail

Real-World Impacts of Metric Choices

Vector Representations in Practice: Industry Transformations

Healthcare Spotlight: Pattern Recognition in Complex Medical Data

Lead and understand, or face disruption. The naked truth.

How to Measure Real Model Accuracy When Labels Are Noisy

Example: image classification

Range of true accuracy

Probabilistic estimate of true accuracy

The independence paradox

Error correlation: why models often struggle where humans do

Best practices

Conclusion

Ivory Tower Notes: The Problem

How every Machine Learning and AI journey starts

So, what’s “the” problem about?

Think like a Data Scientist.

Ask the “good” questions.

Defining the problem determines every step after.

Now, going back to memory lane…

References:

Why CatBoost Works So Well: The Engineering Behind the Magic

Target Statistic

Greedy Target Statistic

Leave One Out Target Statistic

Ordered Target Statistic

Ordered Boosting

Building a Tree

Split Candidates

Oblivious Trees

Conclusion

Further Reading

Mining Rules from Data

Case

Data Preprocessing

Decision Tree Classifier: Theory

Decision Trees: practice

Working with high cardinality categories

Summary

Reference