Data Science | Towards Data Science

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Salvatore Raieli — Fri, 11 Apr 2025 05:44:46 +0000

Liberating education consists in acts of cognition, not transferrals of information.
Paulo freire

One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing?

Many authors suggest that artificial intelligence models do not possess the same capabilities as humans, especially when it comes to plasticity, flexibility, and adaptation.

One of the aspects that models do not capture are several causal relationships about the external world.

This article discusses these issues:

The parallelism between convolutional neural networks (CNNs) and the human visual cortex
Limitations of CNNs in understanding causal relations and learning abstract concepts
How to make CNNs learn simple causal relations

Is it the same? Is it different?

Convolutional networks (CNNs) [2] are multi-layered neural networks that take images as input and can be used for multiple tasks. One of the most fascinating aspects of CNNs is their inspiration from the human visual cortex [1]:

Hierarchical processing. The visual cortex processes images hierarchically, where early visual areas capture simple features (such as edges, lines, and colors) and deeper areas capture more complex features such as shapes, objects, and scenes. CNN, due to its layered structure, captures edges and textures in the early layers, while layers further down capture parts or whole objects.
Receptive fields. Neurons in the visual cortex respond to stimuli in a specific local region of the visual field (commonly called receptive fields). As we go deeper, the receptive fields of the neurons widen, allowing more spatial information to be integrated. Thanks to pooling steps, the same happens in CNNs.
Feature sharing. Although biological neurons are not identical, similar features are recognized across different parts of the visual field. In CNNs, the various filters scan the entire image, allowing patterns to be recognized regardless of location.
Spatial invariance. Humans can recognize objects even when they are moved, scaled, or rotated. CNNs also possess this property.

The relationship between components of the visual system and CNN. Image source: here

These features have made CNNs perform well in visual tasks to the point of superhuman performance:

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well-trained on the validation images to be better aware of the existence of relevant classes. […] Our result (4.94%) exceeds the reported human-level performance. —source [3]

Although CNNs perform better than humans in several tasks, there are still cases where they fail spectacularly. For example, in a 2024 study [4], AI models failed to generalize image classification. State-of-the-art models perform better than humans for objects on upright poses but fail when objects are on unusual poses.

The right label is on the top of the object, and the AI wrong predicted label is below. Image source: here

In conclusion, our results show that (1) humans are still much more robust than most networks at recognizing objects in unusual poses, (2) time is of the essence for such ability to emerge, and (3) even time-limited humans are dissimilar to deep neural networks. —source [4]

In the study [4], they note that humans need time to succeed in a task. Some tasks require not only visual recognition but also abstractive cognition, which requires time.

The generalization abilities that make humans capable come from understanding the laws that govern relations among objects. Humans recognize objects by extrapolating rules and chaining these rules to adapt to new situations. One of the simplest rules is the “same-different relation”: the ability to define whether two objects are the same or different. This ability develops rapidly during infancy and is also importantly associated with language development [5-7]. In addition, some animals such as ducks and chimpanzees also have it [8]. In contrast, learning same-different relations is very difficult for neural networks [9-10].

Example of a same-different task for a CNN. The network should return a label of 1 if the two objects are the same or a label of 0 if they are different. Image source: here

Convolutional networks show difficulty in learning this relationship. Likewise, they fail to learn other types of causal relationships that are simple for humans. Therefore, many researchers have concluded that CNNs lack the inductive bias necessary to be able to learn these relationships.

These negative results do not mean that neural networks are completely incapable of learning same-different relations. Much larger and longer trained models can learn this relation. For example, vision-transformer models pre-trained on ImageNet with contrastive learning can show this ability [12].

Can CNNs learn same-different relationships?

The fact that broad models can learn these kinds of relationships has rekindled interest in CNNs. The same-different relationship is considered among the basic logical operations that make up the foundations for higher-order cognition and reasoning. Showing that shallow CNNs can learn this concept would allow us to experiment with other relationships. Moreover, it will allow models to learn increasingly complex causal relationships. This is an important step in advancing the generalization capabilities of AI.

Previous work suggests that CNNs do not have the architectural inductive biases to be able to learn abstract visual relations. Other authors assume that the problem is in the training paradigm. In general, the classical gradient descent is used to learn a single task or a set of tasks. Given a task t or a set of tasks T, a loss function L is used to optimize the weights φ that should minimize the function L:

Image source from here

This can be viewed as simply the sum of the losses across different tasks (if we have more than one task). Instead, the Model-Agnostic Meta-Learning (MAML) algorithm [13] is designed to search for an optimal point in weight space for a set of related tasks. MAML seeks to find an initial set of weights θ that minimizes the loss function across tasks, facilitating rapid adaptation:

Image source from here

The difference may seem small, but conceptually, this approach is directed toward abstraction and generalization. If there are multiple tasks, traditional training tries to optimize weights for different tasks. MAML tries to identify a set of weights that is optimal for different tasks but at the same time equidistant in the weight space. This starting point θ allows the model to generalize more effectively across different tasks.

Meta-learning initial weights for generalization. Image source from here

Since we now have a method biased toward generalization and abstraction, we can test whether we can make CNNs learn the same-different relationship.

In this study [11], they compared shallow CNNs trained with classic gradient descent and meta-learning on a dataset designed for this report. The dataset consists of 10 different tasks that test for the same-different relationship.

The Same-Different dataset. Image source from here

The authors [11] compare CNNs of 2, 4, or 6 layers trained in a traditional way or with meta-learning, showing several interesting results:

The performance of traditional CNNs shows similar behavior to random guessing.
Meta-learning significantly improves performance, suggesting that the model can learn the same-different relationship. A 2-layer CNN performs little better than chance, but by increasing the depth of the network, performance improves to near-perfect accuracy.

Comparison between traditional training and meta-learning for CNNs. Image source from here

One of the most intriguing results of [11] is that the model can be trained in a leave-one-out way (use 9 tasks and leave one out) and show out-of-distribution generalization capabilities. Thus, the model has learned abstracting behavior that is hardly seen in such a small model (6 layers).

out-of-distribution for same-different classification. Image source from here

Conclusions

Although convolutional networks were inspired by how the human brain processes visual stimuli, they do not capture some of its basic capabilities. This is especially true when it comes to causal relations or abstract concepts. Some of these relationships can be learned from large models only with extensive training. This has led to the assumption that small CNNs cannot learn these relations due to a lack of architecture inductive bias. In recent years, efforts have been made to create new architectures that could have an advantage in learning relational reasoning. Yet most of these architectures fail to learn these kinds of relationships. Intriguingly, this can be overcome through the use of meta-learning.

The advantage of meta-learning is to incentivize more abstractive learning. Meta-learning pressure toward generalization, trying to optimize for all tasks at the same time. To do this, learning more abstract features is favored (low-level features, such as the angles of a particular shape, are not useful for generalization and are disfavored). Meta-learning allows a shallow CNN to learn abstract behavior that would otherwise require many more parameters and training.

The shallow CNNs and same-different relationship are a model for higher cognitive functions. Meta-learning and different forms of training could be useful to improve the reasoning capabilities of the models.

Another thing!

You can look for my other articles on Medium, and you can also connect or reach me on LinkedIn or in Bluesky. Check this repository, which contains weekly updated ML & AI news, or here for other tutorials and here for AI reviews. I am open to collaborations and projects, and you can reach me on LinkedIn.

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Lindsay, 2020, Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future, link
Li, 2020, A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects, link
He, 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, link
Ollikka, 2024, A comparison between humans and AI at recognizing objects in unusual poses, link
Premark, 1981, The codes of man and beasts, link
Blote, 1999, Young children’s organizational strategies on a same–different task: A microgenetic study and a training study, link
Lupker, 2015, Is there phonologically based priming in the same-different task? Evidence from Japanese-English bilinguals, link
Gentner, 2021, Learning same and different relations: cross-species comparisons, link
Kim, 2018, Not-so-clevr: learning same–different relations strains feedforward neural networks, link
Puebla, 2021, Can deep convolutional neural networks support relational reasoning in the same-different task? link
Gupta, 2025, Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation, link
Tartaglini, 2023, Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations, link
Finn, 2017, Model-agnostic meta-learning for fast adaptation of deep networks, link

The post The Basis of Cognitive Complexity: Teaching CNNs to See Connections appeared first on Towards Data Science.

Why CatBoost Works So Well: The Engineering Behind the Magic

Shubham Gandhi — Thu, 10 Apr 2025 00:28:11 +0000

Gradient boosting is a cornerstone technique for modeling tabular data due to its speed and simplicity. It delivers great results without any fuss. When you look around you’ll see multiple options like LightGBM, XGBoost, etc. Catboost is one such variant. In this post, we will take a detailed look at this model, explore its inner workings, and understand what makes it a great choice for real-world tasks.

Target Statistic

Target Encoding Example: the average value of the target variable for a category is used to replace each category. Image by author

Target Encoding Example: the average value of the target variable for a category is used to replace each category

One of the important contributions of the CatBoost paper is a new method of calculating the Target Statistic. What is a Target Statistic? If you have worked with categorical variables before, you’d know that the most rudimentary way to deal with categorical variables is to use one-hot encoding. From experience, you’d also know that this introduces a can of problems like sparsity, curse of dimensionality, memory issues, etc. Especially for categorical variables with high cardinality.

Greedy Target Statistic

To avoid one-hot encoding, we calculate the Target Statistic instead for the categorical variables. This means we calculate the mean of the target variable at each unique value of the categorical variable. So if a categorical variable takes the values — A, B, C then we will calculate the average value of \(\text{y}\) over all these values and replace these values with the average of \(\text{y}\) at each unique value.

That sounds good, right? It does but this approach comes with its problems — namely Target Leakage. To understand this, let’s take an extreme example. Extreme examples are often the easiest way to eke out issues in the approach. Consider the below dataset:

Categorical Column	Target Column
A	0
B	1
C	0
D	1
E	0

Greedy Target Statistic: Compute the mean target value for each unique category

Now let’s write the equation for calculating the Target Statistic:
\[\hat{x}^i_k = \frac{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} \cdot y_j + a p
}{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} + a
}\]

Here \(x^i_j\) is the value of the i-th categorical feature for the j-th sample. So for the k-th sample, we iterate over all samples of \(x^i\), select the ones having the value \(x^i_k\), and take the average value of \(y\) over those samples. Instead of taking a direct average, we take a smoothened average which is what the \(a\) and \(p\) terms are for. The \(a\) parameter is the smoothening parameter and \(p\) is the global mean of \(y\).

If we calculate the Target Statistic using the formula above, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{ap}{1+a}\)
B	1	\(\frac{1+ap}{1+a}\)
C	0	\(\frac{ap}{1+a}\)
D	1	\(\frac{1+ap}{1+a}\)
E	0	\(\frac{ap}{1+a}\)

Calculation of Greedy Target Statistic with Smoothening

Now if I use this Target Statistic column as my training data, I will get a perfect split at \( threshold = \frac{0.5+ap}{1+a}\). Anything above this value will be classified as 1 and anything below will be classified as 0. I have a perfect classification at this point, so I get 100% accuracy on my training data.

Let’s take a look at the test data. Here, since we are assuming that the feature has all unique values, the Target Statistic becomes—
\[TS = \frac{0+ap}{0+a} = p\]
If \(threshold\) is greater than \(p\), all test data predictions will be \(0\). Conversely, if \(threshold\) is less than \(p\), all test data predictions will be \(1\) leading to poor performance on the test set.

Although we rarely see datasets where values of a categorical variable are all unique, we do see cases of high cardinality. This extreme example shows the pitfalls of using Greedy Target Statistic as an encoding approach.

Leave One Out Target Statistic

So the Greedy TS didn’t work out quite well for us. Let’s try another method— the Leave One Out Target Statistic method. At first glance, this looks promising. But, as it turns out, this too has its problems. Let’s see how with another extreme example. This time let’s assume that our categorical variable \(x^i\) has only one unique value, i.e., all values are the same. Consider the below data:

Categorical Column	Target Column
A	0
A	1
A	0
A	1

Example data for an extreme case where a categorical feature has just one unique value

If calculate the leave one out target statistic, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)

Calculation of Leave One Out Target Statistic with Smoothening

Here:
\(n\) is the total samples in the data (in our case this 4)
\(n^+\) is the number of positive samples in the data (in our case this 2)
\(y_k\) is the value of the target column in that row
Substituting the above, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)

Substituing values of n and n⁺

Now, if I use this Target Statistic column as my training data, I will get a perfect split at \( threshold = \frac{1.5+ap}{4+a}\). Anything above this value will be classified as 0 and anything below will be classified as 1. I have a perfect classification at this point, so I again get 100% accuracy on my training data.

You see the problem, right? My categorical variable which doesn’t have more than a unique value is producing different values for Target Statistic which will perform great on the training data but will fail miserably on the test data.

Ordered Target Statistic

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

CatBoost introduces a technique called Ordered Target Statistic to address the issues discussed above. This is the core principle of CatBoost’s handling of categorical variables.

This method, inspired by online learning, uses only past data to make predictions. CatBoost generates a random permutation (random ordering) of the training data(\(\sigma\)). To compute the Target Statistic for a sample at row \(k\), CatBoost uses samples from row \(1\) to \(k-1\). For the test data, it uses the entire train data to compute the statistic.

Additionally, CatBoost generates a new permutation for each tree, rather than reusing the same permutation each time. This reduces the variance that can arise in the early samples.

Ordered Boosting

This visualization shows how CatBoost computes residuals and updates the model: for sample xᵢ, the model predicts using only earlier data points. Source

Another important innovation introduced by the CatBoost paper is its use of Ordered Boosting. It builds on similar principles as ordered target statistics, where CatBoost randomly permutes the training data at the start of each tree and makes predictions sequentially.

In traditional boosting methods, when training tree \(t\), the model uses predictions from the previous tree \(t−1\) for all training samples, including the one it is currently predicting. This can lead to target leakage, as the model may indirectly use the label of the current sample during training.

To address this issue, CatBoost uses Ordered Boosting where, for a given sample, it only uses predictions from previous rows in the training data to calculate gradients and build trees. For each row \(i\) in the permutation, CatBoost calculates the output value of a leaf using only the samples before \(i\). The model uses this value to get the prediction for row \(i\). Thus, the model predicts each row without looking at its label.

CatBoost trains each tree using a new random permutation to average the variance in early samples in one permutation.
Let’s say we have 5 data points: A, B, C, D, E. CatBoost creates a random permutation of these points. Suppose the permutation is: σ = [C, A, E, B, D]

Step	Data Used to Train	Data Point Being Predicted	Notes
1	—	C	No previous data → use prior
2	C	A	Model trained on C only
3	C, A	E	Model trained on C, A
4	C, A, E	B	Model trained on C, A, E
5	C, A, E, B	D	Model trained on C, A, E, B

Table highlighting how CatBoost uses random permutation to perform training

This avoids using the actual label of the current row to get the prediction thus preventing leakage.

Building a Tree

Each time CatBoost builds a tree, it creates a random permutation of the training data. It calculates the ordered target statistic for all the categorical variables with more than two unique values. For a binary categorical variable, it maps the values to zeros and ones.

CatBoost processes data as if the data is arriving sequentially. It begins with an initial prediction of zero for all instances, meaning the residuals are initially equivalent to the target values.

As training proceeds, CatBoost updates the leaf output for each sample using the residuals of the previous samples that fall into the same leaf. By not using the current sample’s label for prediction, CatBoost effectively prevents data leakage.

Split Candidates

CatBoost bins continuous features to reduce the search space for optimal splits. Each bin edge and split point represents a potential decision threshold. Image by author

At the core of a decision tree lies the task of selecting the optimal feature and threshold for splitting a node. This involves evaluating multiple feature-threshold combinations and selecting the one that gives the best reduction in loss. CatBoost does something similar. It discretizes the continuous variable into bins to simplify the search for the optimal combination. It evaluates each of these feature-bin combinations to determine the best split

CatBoost uses Oblivious Trees, a key difference compared to other trees, where it uses the same split across all nodes at the same depth.

Oblivious Trees

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

Unlike standard decision trees, where different nodes can split on different conditions (feature-threshold), Oblivious Trees split across the same conditions across all nodes at the same depth of a tree. At a given depth, all samples are evaluated at the same feature-threshold combination. This symmetry has several implications:

Speed and simplicity: since the same condition is applied across all nodes at the same depth, the trees produced are simpler and faster to train
Regularization: Since all trees are forced to apply the same condition across the tree at the same depth, there is a regularization effect on the predictions
Parallelization: the uniformity of the split condition, makes it easier to parallelize the tree creation and usage of GPU to accelerate training

Conclusion

CatBoost stands out by directly tackling a long-standing challenge: how to handle categorical variables effectively without causing target leakage. Through innovations like Ordered Target Statistics, Ordered Boosting, and the use of Oblivious Trees, it efficiently balances robustness and accuracy.

If you found this deep dive helpful, you might enjoy another deep dive on the differences between Stochastic Gradient Classifer and Logistic Regression

A Data Scientist’s Guide to Docker Containers

Jonte Dancker — Tue, 08 Apr 2025 20:02:45 +0000

For a ML model to be useful it needs to run somewhere. This somewhere is most likely not your local machine. A not-so-good model that runs in a production environment is better than a perfect model that never leaves your local machine.

However, the production machine is usually different from the one you developed the model on. So, you ship the model to the production machine, but somehow the model doesn’t work anymore. That’s weird, right? You tested everything on your local machine and it worked fine. You even wrote unit tests.

What happened? Most likely the production machine differs from your local machine. Perhaps it does not have all the needed dependencies installed to run your model. Perhaps installed dependencies are on a different version. There can be many reasons for this.

How can you solve this problem? One approach could be to exactly replicate the production machine. But that is very inflexible as for each new production machine you would need to build a local replica.

A much nicer approach is to use Docker containers.

Docker is a tool that helps us to create, manage, and run code and applications in containers. A container is a small isolated computing environment in which we can package an application with all its dependencies. In our case our ML model with all the libraries it needs to run. With this, we do not need to rely on what is installed on the host machine. A Docker Container enables us to separate applications from the underlying infrastructure.

For example, we package our ML model locally and push it to the cloud. With this, Docker helps us to ensure that our model can run anywhere and anytime. Using Docker has several advantages for us. It helps us to deliver new models faster, improve reproducibility, and make collaboration easier. All because we have exactly the same dependencies no matter where we run the container.

As Docker is widely used in the industry Data Scientists need to be able to build and run containers using Docker. Hence, in this article, I will go through the basic concept of containers. I will show you all you need to know about Docker to get started. After we have covered the theory, I will show you how you can build and run your own Docker container.

What is a container?

A container is a small, isolated environment in which everything is self-contained. The environment packages up all code and dependencies.

A container has five main features.

self-contained: A container isolates the application/software, from its environment/infrastructure. Due to this isolation, we do not need to rely on any pre-installed dependencies on the host machine. Everything we need is part of the container. This ensures that the application can always run regardless of the infrastructure.
isolated: The container has a minimal influence on the host and other containers and vice versa.
independent: We can manage containers independently. Deleting a container does not affect other containers.
portable: As a container isolates the software from the hardware, we can run it seamlessly on any machine. With this, we can move it between machines without a problem.
lightweight: Containers are lightweight as they share the host machine’s OS. As they do not require their own OS, we do not need to partition the hardware resource of the host machine.

This might sound similar to virtual machines. But there is one big difference. The difference is in how they use their host computer’s resources. Virtual machines are an abstraction of the physical hardware. They partition one server into multiple. Thus, a VM includes a full copy of the OS which takes up more space.

In contrast, containers are an abstraction at the application layer. All containers share the host’s OS but run in isolated processes. Because containers do not contain an OS, they are more efficient in using the underlying system and resources by reducing overhead.

Containers vs. Virtual Machines (Image by the author based on docker.com)

Now we know what containers are. Let’s get some high-level understanding of how Docker works. I will briefly introduce the technical terms that are used often.

What is Docker?

To understand how Docker works, let’s have a brief look at its architecture.

Docker uses a client-server architecture containing three main parts: A Docker client, a Docker daemon (server), and a Docker registry.

The Docker client is the primary way to interact with Docker through commands. We use the client to communicate through a REST API with as many Docker daemons as we want. Often used commands are docker run, docker build, docker pull, and docker push. I will explain later what they do.

The Docker daemon manages Docker objects, such as images and containers. The daemon listens for Docker API requests. Depending on the request the daemon builds, runs, and distributes Docker containers. The Docker daemon and client can run on the same or different systems.

The Docker registry is a centralized location that stores and manages Docker images. We can use them to share images and make them accessible to others.

Sounds a bit abstract? No worries, once we get started it will be more intuitive. But before that, let’s run through the needed steps to create a Docker container.

Docker Architecture (Image by author based on docker.com)

What do we need to create a Docker container?

It is simple. We only need to do three steps:

create a Dockerfile
build a Docker Image from the Dockerfile
run the Docker Image to create a Docker container

Let’s go step-by-step.

A Dockerfile is a text file that contains instructions on how to build a Docker Image. In the Dockerfile we define what the application looks like and its dependencies. We also state what process should run when launching the Docker container. The Dockerfile is composed of layers, representing a portion of the image’s file system. Each layer either adds, removes, or modifies the layer below it.

Based on the Dockerfile we create a Docker Image. The image is a read-only template with instructions to run a Docker container. Images are immutable. Once we create a Docker Image we cannot modify it anymore. If we want to make changes, we can only add changes on top of existing images or create a new image. When we rebuild an image, Docker is clever enough to rebuild only layers that have changed, reducing the build time.

A Docker Container is a runnable instance of a Docker Image. The container is defined by the image and any configuration options that we provide when creating or starting the container. When we remove a container all changes to its internal states are also removed if they are not stored in a persistent storage.

Using Docker: An example

With all the theory, let’s get our hands dirty and put everything together.

As an example, we will package a simple ML model with Flask in a Docker container. We can then run requests against the container and receive predictions in return. We will train a model locally and only load the artifacts of the trained model in the Docker Container.

I will go through the general workflow needed to create and run a Docker container with your ML model. I will guide you through the following steps:

build model
create requirements.txt file containing all dependencies
create Dockerfile
build docker image
run container

Before we get started, we need to install Docker Desktop. We will use it to view and run our Docker containers later on.

1. Build a model

First, we will train a simple RandomForestClassifier on scikit-learn’s Iris dataset and then store the trained model.

Second, we build a script making our model available through a Rest API, using Flask. The script is also simple and contains three main steps:

extract and convert the data we want to pass into the model from the payload JSON
load the model artifacts and create an onnx session and run the model
return the model’s predictions as json

I took most of the code from here and here and made only minor changes.

2. Create requirements

Once we have created the Python file we want to execute when the Docker container is running, we must create a requirements.txt file containing all dependencies. In our case, it looks like this:

3. Create Dockerfile

The last thing we need to prepare before being able to build a Docker Image and run a Docker container is to write a Dockerfile.

The Dockerfile contains all the instructions needed to build the Docker Image. The most common instructions are

FROM — this specifies the base image that the build will extend.
WORKDIR — this instruction specifies the “working directory” or the path in the image where files will be copied and commands will be executed.
COPY — this instruction tells the builder to copy files from the host and put them into the container image.
RUN — this instruction tells the builder to run the specified command.
ENV — this instruction sets an environment variable that a running container will use.
EXPOSE — this instruction sets the configuration on the image that indicates a port the image would like to expose.
USER — this instruction sets the default user for all subsequent instructions.
CMD ["", ""] — this instruction sets the default command a container using this image will run.

With these, we can create the Dockerfile for our example. We need to follow the following steps:

Determine the base image
Install application dependencies
Copy in any relevant source code and/or binaries
Configure the final image

Let’s go through them step by step. Each of these steps results in a layer in the Docker Image.

First, we specify the base image that we then build upon. As we have written in the example in Python, we will use a Python base image.

Second, we set the working directory into which we will copy all the files we need to be able to run our ML model.

Third, we refresh the package index files to ensure that we have the latest available information about packages and their versions.

Fourth, we copy in and install the application dependencies.

Fifth, we copy in the source code and all other files we need. Here, we also expose port 8080, which we will use for interacting with the ML model.

Sixth, we set a user, so that the container does not run as the root user

Seventh, we define that the example.py file will be executed when we run the Docker container. With this, we create the Flask server to run our requests against.

Besides creating the Dockerfile, we can also create a .dockerignore file to improve the build speed. Similar to a .gitignore file, we can exclude directories from the build context.

If you want to know more, please go to docker.com.

4. Create Docker Image

After we created all the files we needed to build the Docker Image.

To build the image we first need to open Docker Desktop. You can check if Docker Desktop is running by running docker ps in the command line. This command shows you all running containers.

To build a Docker Image, we need to be at the same level as our Dockerfile and requirements.txt file. We can then run docker build -t our_first_image . The -t flag indicates the name of the image, i.e., our_first_image, and the . tells us to build from this current directory.

Once we built the image we can do several things. We can

view the image by running docker image ls
view the history or how the image was created by running docker image history
push the image to a registry by running docker push

5. Run Docker Container

Once we have built the Docker Image, we can run our ML model in a container.

For this, we only need to execute docker run -p 8080:8080 in the command line. With -p 8080:8080 we connect the local port (8080) with the port in the container (8080).

If the Docker Image doesn’t expose a port, we could simply run docker run . Instead of using the image_name, we can also use the image_id.

Okay, once the container is running, let’s run a request against it. For this, we will send a payload to the endpoint by running curl X POST http://localhost:8080/invocations -H "Content-Type:application/json" -d @.path/to/sample_payload.json

Conclusion

In this article, I showed you the basics of Docker Containers, what they are, and how to build them yourself. Although I only scratched the surface it should be enough to get you started and be able to package your next model. With this knowledge, you should be able to avoid the “it works on my machine” problems.

I hope that you find this article useful and that it will help you become a better Data Scientist.

See you in my next article and/or leave a comment.

The post A Data Scientist’s Guide to Docker Containers appeared first on Towards Data Science.

How I Would Learn To Code (If I Could Start Over)

Egor Howell — Fri, 04 Apr 2025 18:43:36 +0000

According to various sources, the average salary for Coding jobs is ~£47.5k in the UK, which is ~35% higher than the median salary of about £35k.

So, coding is a very valuable skill that will earn you more money, not to mention it’s really fun.

I have been coding professionally now for 4 years, working as a data scientist and machine learning engineer and in this post, I will explain how I would learn to code if I had to do it all over again.

My journey

I still remember the time I wrote my first bit of code.
It was 9am on the first day of my physics undergrad, and we were in the computer lab.

The professor explained that computation is an integral part of modern physics as it allows us to run large-scale simulations of everything from subatomic particle collisions to the movement of galaxies.

It sounded amazing.

And the way we started this process was by going through a textbook to learn Fortran.

Yes, you heard that right.

My first programming language was Fortran, specifically Fortran 90.
I learned DO loops before FOR loops. I am definitely a rarity in this case.

In that first lab session, I remember writing “Hello World” as is the usual rite of passage and thinking, “Big woop.”

This is how you write “Hello World” in Fortran in case you are interested.

program hello
print *, 'Hello World!'
end program hello

I actually really struggled to code in Fortran and didn’t do that well on tests we had, which put me off coding.

I still have some old coding projects in Fortran on my GitHub that you can check out.

Looking back, the learning curve to coding is quite steep, but it really does compound, and eventually, it will just click.

I didn’t realise this at the time and actively avoided programming modules in my physics degree, which I regret in hindsight as my progress would have been much quicker.

During my third year, I had to do a research placement as part of my master’s. The company I chose to work for/with used a graphical programming language called LabVIEW to run and manage their experiments.

LabVIEW is based on something called “G” and taught me to think of programming differently than script-based.

However, I haven’t used it since and probably never will, but it was cool to learn then.

I did enjoy the research year somewhat, but the pace at which research moves, at least in physics, is painfully slow. Nothing like the “heyday” from the early 20th century I envisioned.

One day after work a video was recommended to me on my YouTube home page.

For those of you unaware, this was a documentary about DeepMind’s AI AlphaGo that beat the best GO player in the world. Most people thought that an AI could never be good at GO.

From the video, I started to understand how AI worked and learn about neural networks, reinforcement learning, and deep learning.
I found it all so interesting, similar to physics research in the early 20th century.

Ultimately, this is when I started studying for a career in Data Science and machine learning, where I needed to teach myself Python and SQL.

This is where I so-called “fell in love” with coding.
I saw its real potential in actually solving problems, but the main thing was that I had a motivated reason to learn. I was studying to break into a career I wanted to be in, which really drove me.

I then became a data scientist for three years and am now a Machine Learning engineer. During this time, I worked extensively with Python and SQL.

Until a few months ago, those were the only programming languages I knew. I did learn other tools, such as bash/z-shell, AWS, docker, data bricks, snowflake, etc. but not any other “proper” programming languages.

In my spare time, I dabbled a bit with C a couple of years ago, but I have forgotten virtually all of it now. I have some basic scripts on my GitHub if you are interested.

However, in my new role that I started a couple of months ago, I will be using Rust and GO, which I am very much looking forward to learning.

If you are interested in my entire journey to becoming a data scientist and machine learning engineer, you can read about it below:

How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp)

Choose a language

I always recommend starting with a single language.

According to TestGorilla, there are over 8,000 programming languages, so how do you pick one?

Well, I would argue that many of these are useless for most jobs and have probably been developed as pet projects or for really niche cases.

You could choose your first language based on popularity. The Stack Overflow 2024 survey has great information on this. The most popular languages are JavaScript, Python, SQL, and Java.

However, the way I recommend you choose your first language should be based on what you want to do or work as.

Front-end web — JavaScript, HTML, CSS
Back-end web — Java, C#, Python, PHP or GO
iOS/macOS apps — Swift
Andriod apps — Kotlin or Java
Games — C++ or C
Embedded Systems — C or C++
Data science/machine learning / AI — Python and SQL

As I wanted to work in the AI/ML space, I focused my energy mainly on Python and some on SQL. It was probably a 90% / 10% split as SQL is smaller and easier to learn.

To this day, I still only know Python and SQL to a “professional” standard, but that’s fine, as pretty much the whole machine-learning community requires these languages.

This shows that you don’t need to know many languages; I have progressed quite far in my career, only knowing two to a significant depth. Of course, it would vary by sector, but the main point still stands.

So, pick a field you want to enter and choose the most in-demand and relevant language in that field.

Learn the bare minimum

The biggest mistake I see beginners make is getting stuck in “tutorial hell.”

This is where you take course after course but never branch out on your own.

I recommend taking a maximum of two courses on a language — literally any intro course would do — and then starting to build immediately.

And I literally mean, build your own projects and experience the struggle because that’s where learning is done.

You won’t know how to write functions until you do it yourself, you won’t know how to create classes until you do it yourself, and you literally won’t understand loops until you implement them yourself.

So, learn the bare minimum and immediately start experimenting; I promise it will at least 2x your learning curve.

You probably have heard this advice a lot, but in reality it is that simple.

I always say that most things in life are simple but hard to do, especially in programming.

Avoid trends

When I say avoid trends, I don’t mean not to focus on areas that are doing well or in demand in the market.

What I am saying is that when you pick a certain language or specialism, stick with it.

Programming languages all share similar concepts and patterns, so when you learn one, you indirectly improve your ability to pick up another later.

But you still should focus on one language for at least a few months.

Don’t develop “shiny object syndrome” and chase the latest technologies; it’s a game that you will unfortunately lose.

There have been so many “distracting” technologies, such as blockchain, Web3, AI, the list goes on.

Instead, focus on the fundamentals:

Data types
Design patterns
Object-oriented programming
Data structures and algorithms
Problem-solving skills

These topics transcend individual programming languages and are much better to master than the latest Javascript framework!

It’s much better to have a strong understanding of one area than try to learn everything. Not only is this more manageable, but it is also better for your long-term career.

As I said earlier, I have progressed quite well in my career by only knowing Python and SQL, as I learned the required technologies for the field and didn’t get distracted.

I can’t stress how much leverage you will have in your career if you document your learning publicly.

Document your learning

I don’t know why more people don’t do this. Sharing what I have learned online has been the biggest game changer for my career.

Literally committing your code on GitHub is enough, but I really recommend posting on LinkedIn or X, and ideally, you should create blog posts to help you cement your understanding and show off you knowledge to employers.

When I interview candidates, if they have some sort of online presence showing their learnings, that’s immediately a tick in my box and an extra edge over other applicants.

It shows enthusiasm and passion, not to mention increasing your surface area of serendipity.

I know many people are scared to do this, but you are suffering from the spotlight effect. Wikipedia defines this as:

The spotlight effect is the psychological phenomenon by which people tend to believe they are being noticed more than they really are.

No one literally cares if you post online or think about you as much as 1% as you think.

So, start posting.

What about AI?

I could spend hours discussing why AI is not an immediate risk for anyone who wants to work in the coding profession.

You should embrace AI as part of your toolkit, but that’s as far as it will go, and it will definitely not replace programmers in 5 years.

Unless an AGI breakthrough suddenly occurs in the next decade, which is highly unlikely.

I personally doubt the answer to AGI is the cross-entropy loss function, which is what is used in most LLMs nowadays.

It has been shown time and time again that these AI models lack strong mathematical reasoning abilities, which is one of the most fundamental skills to being a good coder.

Even the so-called “software engineer killer” Devin is not as good as the creators initially marketed it.

Most companies are simply trying to boost their investment by hyping AI, and their results are often over-exaggerated with controversial benchmark testing.

When I was building a website, ChatGPT even struggled with simple HTML and CSS, which you can argue is its bread and butter!

Overall, don’t worry about AI if you want to work as a coder; there is much, much bigger fish to fry before we cross that bridge!

NeetCode has done a great video explaining how current AI is incapable of replacing programmers.

Another thing!

Join my free newsletter, Dishing the Data, where I share weekly tips, insights, and advice from my experience as a practicing data scientist. Plus, as a subscriber, you’ll get my FREE Data Science Resume Template!

Connect with me

The post How I Would Learn To Code (If I Could Start Over) appeared first on Towards Data Science.

Creating an AI Agent to Write Blog Posts with CrewAI

Gustavo Santos — Fri, 04 Apr 2025 18:11:49 +0000

Introduction

I love writing. You may notice that if you follow me or my blog. For that reason, I am constantly producing new content and talking about Data Science and Artificial Intelligence.

I discovered this passion a couple of years ago when I was just starting my path in Data Science, learning and evolving my skills. At that time, I heard some more experienced professionals in the area saying that a good study technique was practicing new skills and writing about it somewhere, teaching whatever you learned.

In addition, I had just moved to the US, and nobody knew me here. So I had to start somewhere, creating my professional image in this competitive market. I remember I talked to my cousin, who’s also in the Tech industry, and he told me: write blog posts about your experiences. Tell people what you are doing. And so I did.

And I never stopped.

Fast forward to 2025, now I have almost two hundred published articles, many of them with Towards Data Science, a published Book, and a good audience.

Writing helped me so much in the Data Science area.

Most recently, one of my interests has been the amazing Natural Language Processing and Large Language Models subjects. Learning about how these modern models work is fascinating.

That interest led me to experiment with Agentic Ai as well. So, I learned about CrewAI, an easy and open-source package that helps us build AI agents in a fun and easy way, with little code. I decided to test it by creating a crew of agents to write a blog post, and then see how that goes.

In this post, we will learn how to create those agents and make them work together to produce a simple blog post.

Let’s do that.

What is a Crew?

A crew of AI agents is a combination of two or more agents, each of them performing a task towards a final goal.

In this case study, we will create a crew that will work together to produce a small blog post about a given topic that we will provide.

Crew of Agents workflow. Image by the author

The flow works like this:

We choose a given topic for the agents to write about.
Once the crew is started, it will go to the knowledge base, read some of my previously written articles, and try to mimic my writing style. Then, it generates a set of guidelines and passes it to the next agent.
Next, the Planner agent takes over and searches the Internet looking for good content about the topic. It creates a plan of content and sends it to the next agent.
The Writer agent receives the writing plan and executes it according to the context and information received.
Finally, the content is passed to the last agent, the Editor, who reviews the content and returns the final document as the output.

In the following section, we will see how this can be created.

Code

CrewAI is a great Python package because it simplifies the code for us. So, let’s begin by installing the two needed packages.

pip install crewai crewai-tools

Next, if you want, you can follow the instructions on their Quickstart page and have a full project structure created for you with just a couple of commands on a terminal. Basically, it will install some dependencies, generate the folder structure suggested for CrewAI projects, as well as generate some .yaml and .py files.

I personally prefer to create those myself, but it is up to you. The page is listed in the References section.

Folder Structure

So, here we go.

We will create these folders:

knowledge
config

And these files:

In the config folder: create the files agents.yaml and tasks.yaml
In the knowledge folder, that’s where I will add the files with my writing style.
In the project root: create crew.py and main.py.

Folders structure. Image by the author.

Make sure to create the folders with the names mentioned, as CrewAI looks for agents and tasks inside the config folder and for the knowledge base within a knowledge folder.

Next, let us set our agents.

Agents

The agents are composed of:

Name of the agent: writer_style
Role: LLMs are good role players, so here you can tell them which role to play.
Goal: tell the model what the goal of that agent is.
Backstory: Describe the story behind this agent, who it is, what it does.

writer_style:
  role: >
    Writing Style Analyst
  goal: >
    Thoroughly read the knowledge base and learn the characteristics of the crew, 
    such as tone, style, vocabulary, mood, and grammar.
  backstory: >
    You are an experienced ghost writer who can mimic any writing style.
    You know how to identify the tone and style of the original writer and mimic 
    their writing style.
    Your work is the basis for the Content Writer to write an article on this topic.

I won’t bore you with all the agents created for this crew. I believe you got the idea. It is a set of prompts explaining to each agent what they are going to do. All the agents instructions are stored in the agents.yaml file.

Think of it as if you were a manager hiring people to create a team. Think about what kinds of professionals you would need, and what skills are needed.

We need 4 professionals who will work towards the final goal of producing written content: (1) a Writer Stylist, (2) a Planner, (3) a Writer, and (4) an Editor.

If you want to see the setup for them, just check the full code in the GitHub repository.

Tasks

Now, back to the analogy of the manager hiring people, once we “hired” our entire crew, it is time to separate the tasks. We know that we want to produce a blog post, we have 4 agents, but what each of them will do.

Well, that will be configured in the file tasks.yaml.

To illustrate, let me show you the code for the Writer agent. Once again, these are the parts needed for the prompt:

Name of the task: write
Description: The description is like telling the professional how you want that task to be performed, just like we would tell a new hire how to perform their new job. Give precise instructions to get the best result possible.
Expected output: This is how we want to see the output. Notice that I give instructions like the size of the blog post, the quantity of paragraphs, and other information that helps my agent to give me the expected output.
Agent to perform it: Here, we are indicating the agent who will perform this task, using the same name set in the agents.yaml file.
Output file: Now always applicable, but if so, this is the argument to use. We asked for a markdown file as output.

write:
  description: >
    1. Use the content plan to craft a compelling blog post on {topic}.
    2. Incorporate SEO keywords naturally.
    3. Sections/Subtitles are properly named in an engaging manner. Make sure 
    to add Introduction, Problem Statement, Code, Before You Go, References.
    4. Add a summarizing conclusion - This is the "Before You Go" section.
    5. Proofread for grammatical errors and alignment with the writer's style.
    6. Use analogies to make the article more engaging and complex concepts easier
    to understand.
  expected_output: >
    A well-written blog post in markdown format, ready for publication.
    The article must be within a 7 to 12 minutes read.
    Each section must have at least 3 paragraphs.
    When writing code, you will write a snippet of code and explain what it does. 
    Be careful to not add a huge snippet at a time. Break it in reasonable chunks.
    In the examples, create a sample dataset for the code.
    In the Before You Go section, you will write a conclusion that is engaging
    and factually accurate.
  agent: content_writer
  output_file: blog_post.md

After the agents and tasks are defined, it is time to create our crew flow.

Coding the Crew

Now we will create the file crew.py, where we will translate the previously presented flow to Python code.

We begin by importing the needed modules.

#Imports
import os
from crewai import Agent, Task, Process, Crew, LLM
from crewai.project import CrewBase, agent, crew, task
from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource
from crewai_tools import SerperDevTool

We will use the basic Agent, Task, Crew, Process and LLM to create our flow. PDFKnowledgeSource will help the first agent learning my writing style, and SerperDevTool is the tool to search the internet. For that one, make sure to get your API key at https://serper.dev/signup.

A best practice in software development is to keep your API keys and configuration settings separate from your code. We’ll use a .env file for this, providing a secure place to store these values. Here’s the command to load them into our environment.

from dotenv import load_dotenv
load_dotenv()

Then, we will use the PDFKnowledgeSource to show the Crew where to search for the writer’s style. By default, that tool looks at the knowledge folder of your project, thus the importance of the name being the same.

# Knowledge sources

pdfs = PDFKnowledgeSource(
    file_paths=['article1.pdf',
                'article2.pdf',
                'article3.pdf'
                ]
)

Now we can set up the LLM we want to use for the Crew. It can be any of them. I tested a bunch of them, and those I liked the most were qwen-qwq-32b and gpt-4o. If you choose OpenAI’s, you will need an API Key as well. For Qwen-QWQ, just uncomment the code and comment out the OpenAI’s lines.. You need an API key from Groq.

# LLMs

llm = LLM(
    # model="groq/qwen-qwq-32b",
    # api_key= os.environ.get("GROQ_API_KEY"),
    model= "gpt-4o",
    api_key= os.environ.get("OPENAI_API_KEY"),
    temperature=0.4
)

Now we have to create a Crew Base, showing where CrewAI can find the agents and tasks configuration files.

# Creating the crew: base shows where the agents and tasks are defined

@CrewBase
class BlogWriter():
    """Crew to write a blog post"""
    agents_config = "config/agents.yaml"
    tasks_config = "config/tasks.yaml"

Agents Functions

And we are ready to create the code for each agent. They are composed of a decorator @agent to show that the following function is an agent. We then use the class Agent and indicate the name of the agent in the config file, the level of verbosity, being 1 low, 2 high. You can also use a Boolean value, such as true or false.

Lastly, we specify if the agent uses any tool, and what model it will use.

# Configuring the agents
    @agent
    def writer_style(self) -> Agent:
        return Agent(
                config=self.agents_config['writer_style'],
                verbose=1,
                knowledge_sources=[pdfs]
                )

    @agent
    def planner(self) -> Agent:
        return Agent(
        config=self.agents_config['planner'],
        verbose=True,
        tools=[SerperDevTool()],
        llm=llm
        )

    @agent
    def content_writer(self) -> Agent:
        return Agent(
        config=self.agents_config['content_writer'],
        verbose=1
        )

    @agent
    def editor(self) -> Agent:
        return Agent(
        config=self.agents_config['editor'],
        verbose=1
        )

Tasks Functions

The next step is creating the tasks. Similarly to the agents, we will create a function and decorate it with @task. We use the class Task to inherit CrewAI’s functionalities and then point to the task to be used from our tasks.yaml file to be used for each task created. If any output file is expected, use the output_file argument.

# Configuring the tasks    

    @task
    def style(self) -> Task:
        return Task(
        config=self.tasks_config['mystyle'],
        )

    @task
    def plan(self) -> Task:
        return Task(
        config=self.tasks_config['plan'],
        )

    @task
    def write(self) -> Task:
        return Task(
        config=self.tasks_config['write'],
        output_file='output/blog_post.md' # This is the file that will be contain the final blog post.
        )

    @task
    def edit(self) -> Task:
        return Task(
        config=self.tasks_config['edit']
        )

Crew

To glue everything together, we now create a function and decorate it with the @crew decorator. That function will line up the agents and the tasks in the order to be performed, since the process chosen here is the simplest: sequential. In other words, everything runs in sequence, from start to finish.

@crew

    def crew(self) -> Crew:
        """Creates the Blog Post crew"""

        return Crew(
            agents= [self.writer_style(), self.planner(), self.content_writer(), self.editor(), self.illustrator()],
            tasks= [self.style(), self.plan(), self.write(), self.edit(), self.illustrate()],
            process=Process.sequential,
            verbose=True
        )

Running the Crew

Running the crew is very simple. We create the main.py file and import the Crew Base BlogWriter created. Then we just use the functions crew().kickoff(inputs) to run it, passing a dictionary with the inputs to be used to generate the blog post.

# Script to run the blog writer project

# Warning control
import warnings
warnings.filterwarnings('ignore')
from crew import BlogWriter


def write_blog_post(topic: str):
    # Instantiate the crew
    my_writer = BlogWriter()
    # Run
    result = (my_writer
              .crew()
              .kickoff(inputs = {
                  'topic': topic
                  })
    )

    return result

if __name__ == "__main__":

    write_blog_post("Price Optimization with Python")

There it is. The result is a nice blog post created by the LLM. See below.

Resulting blog post. GIF by the author.

That is so nice!

Before You Go

Before you go, know that this blog post was 100% created by me. This crew I created was an experiment I wanted to do to learn more about how to create AI agents and make them work together. And, like I said, I love writing, so this is something I would be able to read and assess the quality.

My opinion is that this crew still didn’t do a very good job. They were able to complete the tasks successfully, but they gave me a very shallow post and code. I would not publish this, but at least it could be a start, maybe.

From here, I encourage you to learn more about CrewAI. I took their free course where João de Moura, the creator of the package, shows us how to create different kinds of crews. It is really interesting.

GitHub Repository

https://github.com/gurezende/Crew_Writer

About Me

If you want to learn more about my work, or follow my blog (it is really me!), here are my contacts and portfolio.

https://gustavorsantos.me

References

[Quickstart CrewAI](https://docs.crewai.com/quickstart)

[CrewAI Documentation](https://docs.crewai.com/introduction)

[GROQ](https://groq.com/)

[OpenAI](https://openai.com)

[CrewAI Free Course](https://learn.crewai.com/)

The post Creating an AI Agent to Write Blog Posts with CrewAI appeared first on Towards Data Science.

PyScript vs. JavaScript: A Battle of Web Titans

Pol Marin — Wed, 02 Apr 2025 17:15:17 +0000

We’re delving into frontend web development today, and you might be thinking: what does this have to do with Data Science? Why is Towards Data Science publishing a post related to web dev?

Well, because data science isn’t only about building powerful models, engaging in advanced analytics, or cleaning and transforming data—presenting the results is also a key part of our job. And there are several ways to do it: PowerPoint presentations, interactive dashboards (like Tableau), or, as you’ve guessed, through a website.

Speaking from personal experience, I work daily on developing the website we use to present our data-driven results. Using a website instead of PowerPoints or Tableau has many advantages, with freedom and customization being the biggest ones.

Even though I’ve come to (kind of) enjoy JavaScript, it will never match the fun of coding in Python. Luckily, at FOSDEM, I learned about PyScript, and to my surprise, it’s not as alpha as I initially thought.

But is that enough to call it a potential JavaScript replacement? That’s exactly what we’re going to explore today.

JavaScript has been the king of web development for decades. It’s everywhere: from simple button clicks to complex web apps like Gmail and Netflix. But now, there’s a challenger stepping into the ring—PyScript—a framework that lets you run Python in the browser without needing a backend.Sounds like a dream, right? Let’s break it down in an entertaining head-to-head battle between these two web technologies to see if PyScript is a true competitor!

Round 1: What Are They?

This is like the Jake Paul vs Mike Tyson battle: the new challenger (PyScript) vs the veteran champion (JS). Don’t worry, I’m not saying today’s battle will be a disappointment as well.

Let’s start with the veteran: JavaScript.

Created in 1995, JavaScript is the backbone of web development.
Runs natively in browsers, controlling everything from user interactions to animations.
Supported by React, Vue, Angular, and a massive ecosystem of frameworks.
Can directly manipulate the DOM, making web pages dynamic.

Now onto the novice: PyScript.

Built on Pyodide (a Python-to-WebAssembly project), PyScript lets you write Python inside an HTML file.
No need for backend servers—your Python code runs directly in the browser.
Can import Python libraries like NumPy, Pandas, and Matplotlib.
But… it’s still evolving and has limitations.

This last but is a big one, so JavaScript wins the first round!

Round 2: Performance Battle

When it comes to speed, JavaScript is like Usain Bolt—optimized and blazing fast. It runs natively in the browser and is fine-tuned for performance. On the other hand, PyScript runs Python via WebAssembly, which means extra overhead.

Let’s use a real mini-project: a simple counter app. We’ll build it using both alternatives and see which one performs better.

JavaScript

PyScript


from pyscript import display
count = 0

def increment():
    global count
    count += 1
    display(count, target="count")


0

Putting them to the test:

JavaScript runs instantly.
PyScript has a noticeable delay.

End of round: JS increases its advantage making it 2-0!

Round 3: Ease of Use & Readability

Neither of both languages is perfect (for example, neither includes static typing), but their syntax is very different. JavaScript can be quite messy:

const numbers = [1, 2, 3];
const doubled = numbers.map(num => num * 2);

While Python is far easier to understand:

numbers = [1, 2, 3]
doubled = [num * 2 for num in numbers]

The fact that PyScript lets us use the Python syntax makes it the round winner without a doubt. Even though I’m clearly biased towards Python, the fact that it’s beginner-friendly and usually more concise and simple than JS makes it better in terms of usability.

The problem for PyScript is that JavaScript is already deeply integrated into browsers, making it more practical. Despite this, PyScript wins the round making it 2-1.

One more round to go…

Round 4: Ecosystem & Libraries

JavaScript has countless frameworks like React, Vue, and Angular, making it a powerhouse for building dynamic web applications. Its libraries are specifically optimized for the web, providing tools for everything from UI components to complex animations.

On the other hand, PyScript benefits from Python’s vast ecosystem of scientific computing and data science libraries, such as NumPy, Pandas, and Matplotlib. While these tools are excellent for Data Visualization and analysis, they aren’t optimized for frontend web development. Additionally, PyScript requires workarounds to interact with the DOM, which JavaScript handles natively and efficiently.

While PyScript is an exciting tool for embedding Python into web applications, it’s still in its early stages. JavaScript remains the more practical choice for general web development, whereas PyScript shines in scenarios where Python’s computational power is needed within the browser.

Here’s a table summarizing some of the key components

Feature	JavaScript	PyScript
DOM Control	Direct & instant	Requires JavaScript workarounds
Performance	Optimized for browsers	WebAssembly overhead
Ecosystem	Huge (React, Vue, Angular)	Limited, still growing
Libraries	Web-focused (Lodash, D3.js)	Python-focused (NumPy, Pandas)
Use Cases	Full web apps	Data-heavy apps, interactive widgets

Round’s verdict: JavaScript dominates in general web dev, but Pyscript shines for Python-centric projects.

Final Verdict

This was a quick fight! We still don’t know who won though…

Time to reveal it:

If you’re building a full web app, JavaScript is the clear winner.
If you’re adding Python-powered interactivity (e.g., data visualization), PyScript could be useful.

With that said, it’s fair to say that JavaScript (and its derivatives) still remains the web’s frontend best option. However, the future of PyScript is one to watch: If performance improves and it gets better browser integration, PyScript could become a strong hybrid tool for Python developers willing to incorporate more data-related tasks on the frontend.

Winner: JavaScript.

The post PyScript vs. JavaScript: A Battle of Web Titans appeared first on Towards Data Science.

The Art of Hybrid Architectures

Eric Chung — Sat, 29 Mar 2025 03:38:17 +0000

In my previous article, I discussed how morphological feature extractors mimic the way biological experts visually assess images.

This time, I want to go a step further and explore a new question:
Can different architectures complement each other to build an AI that “sees” like an expert?

Introduction: Rethinking Model Architecture Design

While building a high accuracy visual recognition model, I ran into a key challenge:

How do we get AI to not just “see” an image, but actually understand the features that matter?

Traditional CNNs excel at capturing local details like fur texture or ear shape, but they often miss the bigger picture. Transformers, on the other hand, are great at modeling global relationships, how different regions of an image interact, but they can easily overlook fine-grained cues.

This insight led me to explore combining the strengths of both architectures to create a model that not only captures fine details but also comprehends the bigger picture.

While developing PawMatchAI, a 124-breed dog classification system, I went through three major architectural phases:

1. Early Stage: EfficientNetV2-M + Multi-Head Attention

I started with EfficientNetV2-M and added a multi-head attention module.

I experimented with 4, 8, and 16 heads—eventually settling on 8, which gave the best results.

This setup reached an F1 score of 78%, but it felt more like a technical combination than a cohesive design.

2. Refinement: Focal Loss + Advanced Data Augmentation

After closely analyzing the dataset, I noticed a class imbalance, some breeds appeared far more frequently than others, skewing the model’s predictions.

To address this, I introduced Focal Loss, along with RandAug and mixup, to make the data distribution more balanced and diverse.
This pushed the F1 score up to 82.3%.

3. Breakthrough: Switching to ConvNextV2-Base + Training Optimization

Next, I replaced the backbone with ConvNextV2-Base, and optimized the training using OneCycleLR and a progressive unfreezing strategy.
The F1 score climbed to 87.89%.

But during real-world testing, the model still struggled with visually similar breeds, indicating room for improvement in generalization.

4. Final Step: Building a Truly Hybrid Architecture

After reviewing the first three phases, I realized the core issue: stacking technologies isn’t the same as getting them to work together.

What I needed was true collaboration between the CNN, the Transformer, and the morphological feature extractor, each playing to its strengths. So I restructured the entire pipeline.

ConvNextV2 was in charge of extracting detailed local features.
The morphological module acted like a domain expert, highlighting features critical for breed identification.

Finally, the multi-head attention brought it all together by modeling global relationships.

This time, they weren’t just independent modules, they were a team.
CNNs identified the details, the morphology module amplified the meaningful ones, and the attention mechanism tied everything into a coherent global view.

Key Result: The F1 score rose to 88.70%, but more importantly, this gain came from the model learning to understand morphology, not just memorize textures or colors.

It started recognizing subtle structural features—just like a real expert would—making better generalizations across visually similar breeds.

If you’re interested, I’ve written more about morphological feature extractors here.

These extractors mimic how biological experts assess shape and structure, enhancing critical visual cues like ear shape and body proportions.

They’re a vital part of this hybrid design, filling the gaps traditional models tend to overlook.

In this article, I’ll walk through:

The strengths and limitations of CNNs vs. Transformers—and how they can complement each other
Why I ultimately chose ConvNextV2 over EfficientNetV2
The technical details of multi-head attention and how I decided the number of heads
How all these elements came together in a unified hybrid architecture
And finally, how heatmaps reveal that the AI is learning to “see” key features, just like a human expert

1. The Strengths and Limitations of CNNs and Transformers

In the previous section, I discussed how CNNs and Transformers can effectively complement each other. Now, let’s take a closer look at what sets each architecture apart, their individual strengths, limitations, and how their differences make them work so well together.

1.1 The Strength of CNNs: Great with Details, Limited in Scope

CNNs are like meticulous artists, they can draw fine lines beautifully, but often miss the bigger composition.

Strong at Local Feature Extraction
CNNs are excellent at capturing edges, textures, and shapes—ideal for distinguishing fine-grained features like ear shapes, nose proportions, and fur patterns across dog breeds.

Computational Efficiency
With parameter sharing, CNNs process high-resolution images more efficiently, making them well-suited for large-scale visual tasks.

Translation Invariance
Even when a dog’s pose varies, CNNs can still reliably identify its breed.

That said, CNNs have two key limitations:

Limited Receptive Field:
CNNs expand their field of view layer by layer, but early-stage neurons only “see” small patches of pixels. As a result, it’s difficult for them to connect features that are spatially far apart.

For instance: When identifying a German Shepherd, the CNN might spot upright ears and a sloped back separately, but struggle to associate them as defining characteristics of the breed.

Lack of Global Feature Integration:
CNNs excel at local stacking of features, but they’re less adept at combining information from distant regions.

Example: To distinguish a Siberian Husky from an Alaskan Malamute, it’s not just about one feature, it’s about the combination of ear shape, facial proportions, tail posture, and body size. CNNs often struggle to consider these elements holistically.

1.2 The Strength of Transformers: Global Awareness, But Less Precise

Transformers are like master strategists with a bird’s-eye view, they quickly spot patterns, but aren’t great at filling in the fine details.

Capturing Global Context
Thanks to their self-attention mechanism, Transformers can directly link any two features in an image, no matter how far apart they are.

Dynamic Attention Weighting
Unlike CNNs’ fixed kernels, Transformers dynamically allocate focus based on context.

Example: When identifying a Poodle, the model may prioritize fur texture; when it sees a Bulldog, it might focus more on facial structure.

But Transformers also have two major drawbacks:

High Computational Cost:
Self-attention has a time complexity of O(n²). As image resolution increases, so does the cost—making training more intensive.

Weak at Capturing Fine Details:
Transformers lack CNNs’ “built-in intuition” that nearby pixels are usually related.

Example: On their own, Transformers might miss subtle differences in fur texture or eye shape, details that are crucial for distinguishing visually similar breeds.

1.3 Why a Hybrid Architecture Is Necessary

Let’s take a real world case:

How do you distinguish a Golden Retriever from a Labrador Retriever?

They’re both beloved family dogs with similar size and temperament. But experts can easily tell them apart by observing:

Golden Retrievers have long, dense coats ranging from golden to dark gold, more elongated heads, and distinct feathering around ears, legs, and tails.
Labradors, on the other hand, have short, double-layered coats, more compact bodies, rounder heads, and thick otter-like tails. Their coats come in yellow, chocolate, or black.

Interestingly, for humans, this distinction is relatively easy, “long hair vs. short hair” might be all you need.

But for AI, relying solely on coat length (a texture-based feature) is often unreliable. Lighting, image quality, or even a trimmed Golden Retriever can confuse the model.

When analyzing this challenge, we can see…

The problem with using only CNNs:

While CNNs can detect individual features like “coat length” or “tail shape,” they struggle with combinations like “head shape + fur type + body structure.” This issue worsens when the dog is in a different pose.

The problem with using only Transformers:

Transformers can associate features across the image, but they’re not great at picking up fine-grained cues like slight variations in fur texture or subtle head contours. They also require large datasets to achieve expert-level performance.
Plus, their computational cost increases sharply with image resolution, slowing down training.

These limitations highlight a core truth:

Fine-grained visual recognition requires both local detail extraction and global relationship modeling.

A truly expert system like a veterinarian or show judge must inspect features up close while understanding the overall structure. That’s exactly where hybrid architectures shine.

1.4 The Advantages of a Hybrid Architecture

This is why we need hybrid systems architectures that combine CNNs’ precision in local features with Transformers’ ability to model global relationships:

CNNs: Extract local, fine-grained features like fur texture and ear shape, crucial for spotting subtle differences.
Transformers: Capture long-range dependencies (e.g., head shape + body size + eye color), allowing the model to reason holistically.
Morphological Feature Extractors: Mimic human expert judgment by emphasizing diagnostic features, bridging the gap left by data-driven models.

Such an architecture not only boosts evaluation metrics like the F1 Score, but more importantly, it enables the AI to genuinely understand the subtle distinctions between breeds, getting closer to the way human experts think. The model learns to weigh multiple features together, instead of over-relying on one or two unstable cues.

In the next section, I’ll dive into how I actually built this hybrid architecture, especially how I selected and integrated the right components.

2. Why I Chose ConvNextV2: Key Innovations Behind the Backbone

Among the many visual recognition architectures available, why did I choose ConvNextV2 as the backbone of my project?

Because its design effectively combines the best of both worlds: the CNN’s ability to extract precise local features, and the Transformer’s strength in capturing long-range dependencies.

Let’s break down three core innovations that made it the right fit.

2.1 FCMAE Self-Supervised Learning: Adaptive Learning Inspired by the Human Brain

Imagine learning to navigate with your eyes covered, your brain becomes laser-focused on memorizing the details you can perceive.

ConvNextV2 uses a self-supervised pretraining strategy similar to that of Vision Transformers.

During training, up to 60% of input pixels are intentionally masked, and the model must learn to reconstruct the missing regions.
This “make learning harder on purpose” approach actually leads to three major benefits:

Comprehensive Feature Learning
The model learns the underlying structure and patterns of an image—not just the most obvious visual cues.
In the context of breed classification, this means it pays attention to fur texture, skeletal structure, and body proportions, instead of relying solely on color or shape.
Reduced Dependence on Labeled Data
By pretraining on unlabeled dog images, the model develops strong visual representations.
Later, with just a small amount of labeled data, it can fine-tune effectively—saving significant annotation effort.
Improved Recognition of Rare Patterns
The reconstruction task pushes the model to learn generalized visual rules, enhancing its ability to identify rare or underrepresented breeds.

2.2 GRN Global Calibration: Mimicking an Expert’s Attention

Like a seasoned photographer who adjusts the exposure of each element to highlight what truly matters.

GRN (Global Response Normalization) is arguably the most impactful innovation in ConvNextV2, giving CNNs a degree of global awareness that was previously lacking:

Dynamic Feature Recalibration
GRN globally normalizes the feature map, amplifying the most discriminative signals while suppressing irrelevant ones.
For instance, when identifying a German Shepherd, it emphasizes upright ears and the sloped back while minimizing background noise.
Enhanced Sensitivity to Subtle Differences
This normalization sharpens feature contrast, making it easier to spot fine-grained differences—critical for telling apart breeds like the Siberian Husky and Alaskan Malamute.
Focus on Diagnostic Features
GRN helps the model prioritize features that truly matter for classification, rather than relying on statistically correlated but causally irrelevant cues.

2.3 Sparse and Efficient Convolutions: More with Less

Like a streamlined team where each member plays to their strengths, reducing redundancy while boosting performance.

ConvNextV2 incorporates architectural optimizations such as depthwise separable convolutions and sparse connections, resulting in three major gains:

Improved Computational Efficiency
By breaking down convolutions into smaller, more efficient steps, the model reduces its computational load.
This allows it to process high-resolution dog images and detect fine visual differences without requiring excessive resources.
Expanded Effective Receptive Field
The layout of convolutions is designed to extend the model’s field of view, helping it analyze both overall body structure and local details simultaneously.
Parameter Efficiency
The architecture ensures that each parameter carries more learning capacity, extracting richer, more nuanced information using the same amount of compute.

2.4 Why ConvNextV2 Was the Right Fit for a Hybrid Architecture

ConvNextV2 turned out to be the perfect backbone for this hybrid system, not just because of its performance, but because it embodies the very philosophy of fusion.

It retains the local precision of CNNs while adopting key design concepts from Transformers to expand its global awareness. This duality makes it a natural bridge between CNNs and Transformers apable of preserving fine-grained details while understanding the broader context.

It also lays the groundwork for additional modules like multi-head attention and morphological feature extractors, ensuring the model starts with a complete, balanced feature set.

In short, ConvNextV2 doesn’t just “see the parts”, it starts to understand how the parts come together. And in a task like dog breed classification, where both minute differences and overall structure matter, this kind of foundation is what transforms an ordinary model into one that can reason like an expert.

3. Technical Implementation of the MultiHeadAttention Mechanism

In neural networks, the core concept of the attention mechanism is to enable models to “focus” on key parts of the input, similar to how human experts consciously focus on specific features (such as ear shape, muzzle length, tail posture) when identifying dog breeds.
The Multi-Head Attention (MHA) mechanism further enhances this ability:

“Rather than having one expert evaluate all features, it’s better to form a panel of experts, letting each focus on different details, and then synthesize a final judgment!”

Mathematically, MHA uses multiple linear projections to allow the model to simultaneously learn different feature associations, further enhancing performance.

3.1 Understanding MultiHeadAttention from a Mathematical Perspective

The core idea of MultiHeadAttention is to use multiple different projections to allow the model to simultaneously attend to patterns in different subspaces. Mathematically, it first projects input features into three roles: Query, Key, and Value, then calculates the similarity between Query (Q) and Key (K), and uses this similarity to perform weighted averaging of Values.

The basic formula can be expressed as:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

3.2 Application of Einstein Summation Convention in Attention Calculation

In the implementation, I used the torch.einsum function based on the Einstein summation convention to efficiently calculate attention scores:

energy = torch.einsum("nqd,nkd->nqk", [q, k])

This means:
q has shape (batch_size, num_heads, query_dim)
k has shape (batch_size, num_heads, key_dim)
The dot product is performed on dimension d, resulting in (batch_size, num_heads, query_len, key_len) This is essentially “calculating similarity between each Query and all Keys,” generating an attention weight matrix

3.3 Implementation Code Analysis

Key implementation code for MultiHeadAttention:

def forward(self, x):

    N = x.shape[0]  # batch size

    # 1. Project input, prepare for multi-head attention calculation
    x = self.fc_in(x)  # (N, input_dim) → (N, scaled_dim)

    # 2. Calculate Query, Key, Value, and reshape into multi-head form
    q = self.query(x).view(N, self.num_heads, self.head_dim)  # query
    k = self.key(x).view(N, self.num_heads, self.head_dim)    # key
    v = self.value(x).view(N, self.num_heads, self.head_dim)  # value

    # 3. Calculate attention scores (similarity matrix)
    energy = torch.einsum("nqd,nkd->nqk", [q, k])

    # 4. Apply softmax (normalize weights) and perform scaling
    attention = F.softmax(energy / (self.head_dim ** 0.5), dim=2)

    # 5. Use attention weights to perform weighted sum on Value
    out = torch.einsum("nqk,nvd->nqd", [attention, v])

    # 6. Rearrange output and pass through final linear layer
    out = out.reshape(N, self.scaled_dim)
    out = self.fc_out(out)

    return out

3.3.1. Steps 1-2: Projection and Multi-Head Splitting
First, input features are projected through a linear layer, and then separately projected into query, key, and value spaces. Importantly, these projections not only change the feature representation but also split them into multiple “heads,” each attending to different feature subspaces.

3.3.2. Steps 3-4: Attention Calculation

3.3.3. Steps 5-6: Weighted Aggregation and Output Projection
Using the calculated attention weights, weighted summation is performed on the value vectors to obtain the attended feature representation. Finally, outputs from all heads are concatenated and passed through an output projection layer to get the final result.

This implementation has the following simplifications and adjustments compared to standard Transformer MultiHeadAttention: Query, key, and value come from the same input (self-attention), suitable for processing features obtained from CNN backbone networks.

It uses einsum operations to simplify matrix calculations.

The design of projection layers ensures dimensional consistency, facilitating integration with other modules.

3.4 How Attention Mechanisms Enhance Understanding of Morphological Feature Relationships

The multi-head attention mechanism brings three core advantages to dog breed recognition:

3.4.1. Feature Relationship Modeling

Just as a professional veterinarian not only sees that ears are upright but also notices how this combines with tail curl degree and skull shape to form a dog breed’s “feature combination.”

It can establish associations between different morphological features, capturing their synergistic relationships, not just seeing “what features exist” but observing “how these features combine.”

Application: The model can learn that a combination of “pointed ears + curled tail + medium build” points to specific Northern dog breeds.

3.4.2. Dynamic Feature Importance Assessment

Just as experts know to focus particularly on fur texture when identifying Poodles, while focusing mainly on the distinctive nose and head structure when identifying Bulldogs.

It dynamically adjusts focus on different features based on the specific content of the input.

Key features vary across different breeds, and the attention mechanism can adaptively focus.

Application: When seeing a Border Collie, the model might focus more on fur color distribution; when seeing a Dachshund, it might focus more on body proportions

3.4.3. Complementary Information Integration

Like a team of experts with different specializations, one focusing on skeletal structure, another on fur features, another analyzing behavioral posture, making a more comprehensive judgment together.

Through multiple attention heads, each simultaneously captures different types of feature relationships. Each head can focus on a specific type of feature or relationship pattern.

Application: One head might primarily focus on color patterns, another on body proportions, and yet another on facial features, ultimately synthesizing these perspectives to make a judgment.

By combining these three capabilities, the MultiHeadAttention mechanism goes beyond identifying individual features, it learns to model the complex relationships between them, capturing subtle patterns that emerge from their combinations and enabling more accurate recognition.

4. Implementation Details of the Hybrid Architecture

4.1 The Overall Architectural Flow

When designing this hybrid architecture, my goal was simple yet ambitious:

Let each component do what it does best, and build a complementary system where they enhance one another.

Much like a well-orchestrated symphony, each instrument (or module) plays its role, only together can they create harmony.
In this setup:

The CNN focuses on capturing local details.
The morphological feature extractor enhances key structural features.
The multi-head attention module learns how these features interact.

As shown in the diagram above, the overall model operates through five key stages:

4.1.1. Feature Extraction

Once an image enters the model, ConvNextV2 takes charge of extracting foundational features, such as fur color, contours, and texture. This is where the AI begins to “see” the basic shape and appearance of the dog.

4.1.2. Morphological Feature Enhancement

These initial features are then refined by the morphological feature extractor. This module functions like an expert’s eye—highlighting structural characteristics such as ear shape and body proportions. Here, the AI learns to focus on what actually matters.

4.1.3. Feature Fusion

Next comes the feature fusion layer, which merges the local features with the enhanced morphological cues. But this isn’t just a simple concatenation, the layer also models how these features interact, ensuring the AI doesn’t treat them in isolation, but rather understands how they combine to convey meaning.

4.1.4. Feature Relationship Modeling

The fused features are passed into the multi-head attention module, which builds contextual relationships between different attributes. The model begins to understand combinations like “ear shape + fur texture + facial proportions” rather than looking at each trait independently.

4.1.5. Final Classification

After all these layers of processing, the model moves to its final classifier, where it makes a prediction about the dog’s breed, based on the rich, integrated understanding it has developed.

4.2 Integrating ConvNextV2 and Parameter Setup

For implementation, I chose the pretrained ConvNextV2-base model as the backbone:

self.backbone = timm.create_model(
    'convnextv2_base',
    pretrained=True,
    num_classes=0)  # Use only the feature extractor; remove original classification head

Depending on the input image size or backbone architecture, the feature output dimensions may vary. To build a robust and flexible system, I designed a dynamic feature dimension detection mechanism:

with torch.no_grad():
    dummy_input = torch.randn(1, 3, 224, 224)
    features = self.backbone(dummy_input)
    if len(features.shape) > 2:
        features = features.mean([-2, -1])  # Global average pooling to produce a 1D feature vector
    self.feature_dim = features.shape[1]

This ensures the system automatically adapts to any feature shape changes, keeping all downstream components functioning properly.

4.3 Intelligent Configuration of the Multi-Head Attention Layer

As mentioned earlier, I experimented with several head counts. Too many heads increased computation and risked overfitting. I ultimately settled on eight, but allowed the number of heads to adjust automatically based on feature dimensions:

self.num_heads = max(1, min(8, self.feature_dim // 64))
self.attention = MultiHeadAttention(self.feature_dim, num_heads=self.num_heads)

4.4 Making CNN, Transformers, and Morphological Features Work Together

The morphological feature extractor works hand-in-hand with the attention mechanism.

While the former provides structured representations of key traits, the latter models relationships among these features:

# Feature fusion
combined_features = torch.cat([
    features,  # Base features
    morphological_features,  # Morphological features
    features * morphological_features  # Interaction between features
], dim=1)
fused_features = self.feature_fusion(combined_features)

# Apply attention
attended_features = self.attention(fused_features)

# Final classification
logits = self.classifier(attended_features)

return logits, attended_features

A special note about the third component features * morphological_features — this isn’t just a mathematical multiplication. It creates a form of dialogue between the two feature sets, allowing them to influence each other and generate richer representations.

For example, suppose the model picks up “pointy ears” from the base features, while the morphological module detects a “small head-to-body ratio.”

Individually, these may not be conclusive, but their interaction may strongly suggest a specific breed, like a Corgi or Finnish Spitz. It’s no longer just about recognizing ears or head size, the model learns to interpret how features work together, much like an expert would.
This full pipeline from feature extraction, through morphological enhancement and attention-driven modeling, to prediction is my vision of what an ideal architecture should look like.

The design has several key advantages:

The morphological extractor brings structured, expert-inspired understanding.
The multi-head attention uncovers contextual relationships between traits.
The feature fusion layer captures nonlinear interactions through element-wise multiplication.

4.5 Technical Challenges and How I Solved Them

Building a hybrid architecture like this was far from smooth sailing.
Here are several challenges I faced and how solving them helped me improve the overall design:

4.5.1. Mismatched Feature Dimensions

Challenge: Output sizes varied across modules, especially when switching backbone networks.
Solution: In addition to the dynamic dimension detection mentioned earlier, I implemented adaptive projection layers to unify the feature dimensions.

4.5.2. Balancing Performance and Efficiency

Challenge: More complexity meant more computation.
Solution: I dynamically adjusted the number of attention heads, and used efficient einsum operations to optimize performance.

4.5.3. Overfitting Risk

Challenge: Hybrid models are more prone to overfitting, especially with smaller training sets.
Solution: I applied LayerNorm, Dropout, and weight decay for regularization.

4.5.4. Gradient Flow Issues

Challenge: Deep architectures often suffer from vanishing or exploding gradients.
Solution: I introduced residual connections to ensure gradients flow smoothly during both forward and backward passes.

If you’re interested in exploring the full implementation, feel free to check out the GitHub project here.

5. Performance Evaluation and Heatmap Analysis

The value of a hybrid architecture lies not only in its quantitative performance but also in how it qualitatively “thinks.”

In this section, we’ll use confidence score statistics and heatmap analysis to demonstrate how the model evolved from CNN → CNN+Transformer → CNN+Transformer+MFE, and how each stage brought its visual reasoning closer to that of a human expert.

To ensure that the performance differences came purely from architecture design, I retrained each model using the exact same dataset, augmentation methods, loss function, and training parameters. The only variation was the presence or absence of the Transformer and morphological modules.

In terms of F1 score, the CNN-only model reached 87.83%, the CNN+Transformer variant performed slightly better at 89.48%, and the final hybrid model scored 88.70%. While the transformer-only version showed the highest score on paper, it didn’t always translate into more reliable predictions. In fact, the hybrid model was more consistent in practice and handled similar-looking or blurry cases more reliably.

5.1 Confidence Scores and Statistical Insights

I tested 17 images of Border Collies, including standard photos, artistic illustrations, and various camera angles, to thoroughly assess the three architectures.

While other breeds were also included in the broader evaluation, I chose Border Collie as a representative case due to its distinctive features and frequent confusion with similar breeds.

Figure 1: Model Confidence Score Comparison
As shown above, there are clear performance differences across the three models.

A notable example is Sample #3, where the CNN-only model misclassified the Border Collie as a Collie, with a low confidence score of 0.2492.

While the CNN+Transformer corrected this error, it introduced a new one in Sample #5, misidentifying it as a Shiba Inu with 0.2305 confidence.

The final CNN+Transformer+MFE model correctly identified all samples without error. What’s interesting here is that both misclassifications occurred at low confidence levels (below 0.25).
This suggests that even when the model makes a mistake, it retains a sense of uncertainty—a desirable trait in real world applications. We want models to be cautious when unsure, rather than confidently wrong.

Figure 2: Confidence Score Distribution
Looking at the distribution of confidence scores, the improvement becomes even more evident.

The CNN-only model mostly predicted in the 0.4–0.5 range, with few samples reaching beyond 0.6.

CNN+Transformer showed better concentration around 0.5–0.6, but still had only one sample in the 0.7–0.8 high-confidence range.
The CNN+Transformer+MFE model stood out with 6 samples reaching the 0.7–0.8 confidence level.

This rightward shift in distribution reveals more than just accuracy, it reflects certainty.

The model is evolving from “barely correct” to “confidently correct,” which significantly enhances its reliability in real-world deployment.

Figure 3: Statistical Summary of Model Performance
A deeper statistical breakdown highlights consistent improvements:

Mean confidence score rose from 0.4639 (CNN) to 0.5245 (CNN+Transformer), and finally 0.6122 with the full hybrid setup—a 31.9% increase overall.

Median score jumped from 0.4665 to 0.6827, confirming the overall shift toward higher confidence.

The proportion of high-confidence predictions (≥ 0.5) also showed striking gains:

CNN: 41.18%
CNN+Transformer: 64.71%
CNN+Transformer+MFE: 82.35%

This means that with the final architecture, most predictions are not only correct but confidently correct.

You might notice a slight increase in standard deviation (from 0.1237 to 0.1616), which might seem like a negative at first. But in reality, this reflects a more nuanced response to input complexity:

The model is highly confident on easier samples, and appropriately cautious on harder ones. The improvement in maximum confidence value (from 0.6343 to 0.7746) further shows how this hybrid architecture can make more decisive and assured judgments when presented with straightforward samples.

5.2 Heatmap Analysis: Tracing the Evolution of Model Reasoning

While statistical metrics are helpful, they don’t tell the full story.
To truly understand how the model makes decisions, we need to see what it sees and heatmaps make this possible.

In these heatmaps, red indicates areas of high attention, highlighting the regions the model relies on most during prediction. By analyzing these attention maps, we can observe how each model interprets visual information, revealing fundamental differences in their reasoning styles.

Let’s walk through one representative case.

5.2.1 Frontal View of a Border Collie: From Local Eye Focus to Structured Morphological Understanding
When presented with a frontal image of a Border Collie, the three models reveal distinct attention patterns, reflecting how their architectural designs shape visual understanding.

The CNN-only model produces a heatmap with two sharp attention peaks, both centered on the dog’s eyes. This indicates a strong reliance on local features while overlooking other morphological traits like the ears or facial outline. While eyes are indeed important, focusing solely on them makes the model more vulnerable to variations in pose or lighting. The resulting confidence score of 0.5581 reflects this limitation.

With the CNN+Transformer model, the attention becomes more distributed. The heatmap forms a loose M-shaped pattern, extending beyond the eyes to include the forehead and the space between the eyes. This shift suggests that the model begins to understand spatial relationships between features, not just the features themselves. This added contextual awareness leads to a stronger confidence score of 0.6559.

The CNN+Transformer+MFE model shows the most structured and comprehensive attention map. The heat is symmetrically distributed across the eyes, ears, and the broader facial region. This indicates that the model has moved beyond feature detection and is now capturing how features are arranged as part of a meaningful whole. The Morphological Feature Extractor plays a key role here, helping the model grasp the structural signature of the breed. This deeper understanding boosts the confidence to 0.6972.

Together, these three heatmaps represent a clear progression in visual reasoning, from isolated feature detection, to inter-feature context, and finally to structural interpretation. Even though ConvNeXtV2 is already a powerful backbone, adding Transformer and MFE modules enables the model to not just see features but to understand them as part of a coherent morphological pattern. This shift is subtle but crucial, especially for fine-grained tasks like breed classification.

5.2.2 Error Case Analysis: From Misclassification to True Understanding

This is a case where the CNN-only model misclassified a Border Collie.

Looking at the heatmap, we can see why. The model focuses almost entirely on a single eye, ignoring most of the face. This kind of over-reliance on one local feature makes it easy to confuse breeds that share similar traits in this case, a Collie, which also has similar eye shape and color contrast.

What the model misses are the broader facial proportions and structural details that define a Border Collie. Its low confidence score of 0.2492 reflects that uncertainty.

With the CNN+Transformer model, attention shifts in a more promising direction. It now covers both eyes and parts of the forehead, creating a more balanced attention pattern. This suggests the model is beginning to connect multiple features, rather than depending on just one.

Thanks to self-attention, it can better interpret relationships between facial components, leading to the correct prediction — Border Collie. The confidence score rises to 0.5484, more than double the previous model’s.

The CNN+Transformer+MFE model takes this further by improving morphological awareness. The heatmap now extends to the nose and muzzle, capturing nuanced traits like facial length and mouth shape. These are subtle but important cues that help distinguish herding breeds from one another.

The MFE module seems to guide the model toward structural combinations, not just isolated features. As a result, confidence increases again to 0.5693, showing a more stable, breed-specific understanding.

This progression from a narrow focus on a single eye, to integrating facial traits, and finally to interpreting structural morphology, highlights how hybrid models support more accurate and generalizable visual reasoning.

In this example, the CNN-only model focuses almost entirely on one side of the dog’s face. The rest of the image is nearly ignored. This kind of narrow attention suggests the model didn’t have enough visual context to make a strong decision. It guessed correctly this time, but with a low confidence score of 0.2238, it’s clear that the prediction wasn’t based on solid reasoning.

The CNN+Transformer model shows a broader attention span, but it introduces a different issue, the heatmap becomes scattered. You can even spot a strong attention spike on the far right, completely unrelated to the dog. This kind of misplaced focus likely led to a misclassification as a Shiba Inu, and the confidence score was still low at 0.2305.

This highlights an important point:

Adding a Transformer doesn’t guarantee better judgment unless the model learns where to look. Without guidance, self-attention can amplify the wrong signals and create confusion rather than clarity.

With the CNN+Transformer+MFE model, the attention becomes more focused and structured. The model now looks at key regions like the eyes, nose, and chest, building a more meaningful understanding of the image. But even here, the confidence remains low at 0.1835, despite the correct prediction. This image clearly presented a real challenge for all three models.

That’s what makes this case so interesting.

It reminds us that a correct prediction doesn’t always mean the model was confident. In harder scenarios unusual poses, subtle features, cluttered backgrounds even the most advanced models can hesitate.

And that’s where confidence scores become invaluable.
They help flag uncertain cases, making it easier to design review pipelines where human experts can step in and verify tricky predictions.

5.2.3 Recognizing Artistic Renderings: Testing the Limits of Generalization

Artistic images pose a unique challenge for visual recognition systems. Unlike standard photos with crisp textures and clear lighting, painted artworks are often abstract and distorted. This forces models to rely less on superficial cues and more on deeper, structural understanding. In that sense, they serve as a perfect stress test for generalization.

Let’s see how the three models handle this scenario.

Starting with the CNN-only model, the attention map is scattered, with focus diffused across both sides of the image. There’s no clear structure — just a vague attempt to “see everything,” which usually means the model is unsure what to focus on. That uncertainty is reflected in its confidence score of 0.5394, sitting in the lower-mid range. The model makes the correct guess, but it’s far from confident.

Next, the CNN+Transformer model shows a clear improvement. Its attention sharpens and clusters around more meaningful regions, particularly near the eyes and ears. Even with the stylized brushstrokes, the model seems to infer, “this could be an ear” or “that looks like the facial outline.” It’s starting to map anatomical cues, not just visual textures. The confidence score rises to 0.6977, suggesting a more structured understanding is taking shape.

Finally, we look at the CNN+Transformer+MFE hybrid model. This one locks in with precision. The heatmap centers tightly on the intersection of the eyes and nose — arguably the most distinctive and stable region for identifying a Border Collie, even in abstract form. It’s no longer guessing based on appearance. It’s reading the dog’s underlying structure.

This leap is largely thanks to the MFE, which helps the model focus on features that persist, even when style or detail varies. The result? A confident score of 0.7457, the highest among all three.

This experiment makes something clear:

Hybrid models don’t just get better at recognition, they get better at reasoning.

They learn to look past visual noise and focus on what matters most: structure, proportion, and pattern. And that’s what makes them reliable, especially in the unpredictable, messy real world of images.

Conclusion

As deep learning evolves, we’ve moved from CNNs to Transformers—and now toward hybrid architectures that combine the best of both. This shift reflects a broader change in AI design philosophy: from seeking purity to embracing fusion.

Think of it like cooking. Great chefs don’t insist on one technique. They mix sautéing, boiling, and frying depending on the ingredient. Similarly, hybrid models combine different architectural “flavors” to suit the task at hand.

This fusion design offers several key benefits:

Complementary strengths: Like combining a microscope and a telescope, hybrid models capture both fine details and global context.
Structured understanding: Morphological feature extractors bring expert-level domain insights, allowing models not just to see, but to truly understand.
Dynamic adaptability: Future models might adjust internal attention patterns based on the image, emphasizing texture for spotted breeds, or structure for solid-colored ones.
Wider applicability: From medical imaging to biodiversity and art authentication, any task involving fine-grained visual distinctions can benefit from this approach.

This visual system—blending ConvNeXtV2, attention mechanisms, and morphological reasoning proves that accuracy and intelligence don’t come from any single architecture, but from the right combination of ideas.

Perhaps the future of AI won’t rely on one perfect design, but on learning to combine cognitive strategies just as the human brain does.

References & Data Source

Research References

Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
Dosovitskiy, A., et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Liu, Z., et al. (2022). ConvNeXt: A ConvNet for the 2020s. CVPR 2022
Liu, Z., et al. (2023). ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. CVPR 2023.
Rockt (2018). Einstein Summation Notation Explained Visually. rockt.github.io
Pytorch Org. torch.einsum

Dataset Sources

Stanford Dogs Dataset – Kaggle Dataset
Originally sourced from Stanford Vision Lab – ImageNet Dogs License: Non-commercial research and educational use only Citation: Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for Fine-Grained Image Categorization. FGVC Workshop, CVPR, 2011
Unsplash Images – Additional images of four breeds (Bichon Frise, Dachshund, Shiba Inu, Havanese) were sourced from Unsplash for dataset augmentation.

Thank you for reading. Through developing PawMatchAI, I’ve learned many valuable lessons about AI vision systems and feature recognition. If you have any perspectives or topics you’d like to discuss, I welcome the opportunity to exchange ideas.
Email
GitHub

Disclaimer

The methods and approaches described in this article are based on my personal research and experimental findings. While the Hybrid Architecture has demonstrated improvements in specific scenarios, its performance may vary depending on datasets, implementation details, and training conditions.

This article is intended for educational and informational purposes only. Readers should conduct independent evaluations and adapt the approach based on their specific use cases. No guarantees are made regarding its effectiveness across all applications.

The post The Art of Hybrid Architectures appeared first on Towards Data Science.

Master the 3D Reconstruction Process: A Step-by-Step Guide

Florent Poux, Ph.D. — Fri, 28 Mar 2025 20:25:57 +0000

The 3d Reconstruction journey from 2D photographs to 3D models follows a structured path.

This path consists of distinct steps that build upon each other to transform flat images into spatial information.

Understanding this pipeline is crucial for anyone looking to create high-quality 3D reconstructions.

Let me explain…

Most people think 3D reconstruction means:

Taking random photos around an object
Pressing a button in expensive software
Waiting for magic to happen
Getting perfect results every time
Skipping the fundamentals

No thanks.

The most successful 3D Reconstruction I have seen are built on three core principles:

They use pipelines that work with fewer images but position them better.
They make sure users spend less time processing but achieve cleaner results.
They permit troubleshooting faster because users know exactly where to look.

Therefore, this hints at a nice lesson:

Your 3D models can only be as good as your understanding of how they’re created.

Looking at this from a scientific perspective is really key.

Let us dive right into it!

If you are new to my (3D) writing world, welcome! We are going on an exciting adventure that will allow you to master an essential 3D Python skill.

Once the scene is laid out, we embark on the Python journey. Everything is provided, included resources at the end. You will see Tips (Notes and Growing) to help you get the most out of this article. Thanks to the 3D Geodata Academy for supporting the endeavor. This article is inspired by a small section of Module 1 of the 3D Reconstructor OS Course.

The Complete 3D Reconstruction Workflow

Let me highlight the 3D Reconstruction pipeline with Photogrammetry. The process follows a logical sequence of steps, as illustrated below.

What is important to note, is that each step builds upon the previous one. Therefore, the quality of each stage directly impacts the final result, which is very important to have in mind!

Understanding the entire process is crucial for troubleshooting workflows due to its sequential nature.

With that in mind, let’s detail each step, focusing on both the theory and practical implementation.

Natural Feature Extraction: Finding the Distinctive Points

Natural feature extraction is the foundation of the photogrammetry process. It identifies distinctive points in images that can be reliably located across multiple photographs.

These points serve as anchors that tie different views together.

When working with low-texture objects, consider adding temporary markers or texture patterns to improve feature extraction results.

Common feature extraction algorithms include:

Algorithm	Strengths	Weaknesses	Best For
SIFT	Scale and rotation invariant	Computationally expensive	High-quality, general-purpose reconstruction
SURF	Faster than SIFT	Less accurate than SIFT	Quick prototyping
ORB	Very fast, no patent restrictions	Less robust to viewpoint changes	Real-time applications

Let’s implement a simple feature extraction using OpenCV:

#%% SECTION 1: Natural Feature Extraction
import cv2
import numpy as np
import matplotlib.pyplot as plt

def extract_features(image_path, feature_method='sift', max_features=2000):
    """
    Extract features from an image using different methods.
    """

    # Read the image in color and convert to grayscale
    img = cv2.imread(image_path)
    if img is None:
        raise ValueError(f"Could not read image at {image_path}")
    
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # Initialize feature detector based on method
    if feature_method.lower() == 'sift':
        detector = cv2.SIFT_create(nfeatures=max_features)
    elif feature_method.lower() == 'surf':
        # Note: SURF is patented and may not be available in all OpenCV distributions
        detector = cv2.xfeatures2d.SURF_create(400)  # Adjust threshold as needed
    elif feature_method.lower() == 'orb':
        detector = cv2.ORB_create(nfeatures=max_features)
    else:
        raise ValueError(f"Unsupported feature method: {feature_method}")
    
    # Detect and compute keypoints and descriptors
    keypoints, descriptors = detector.detectAndCompute(gray, None)
    
    # Create visualization
    img_with_features = cv2.drawKeypoints(
        img, keypoints, None, 
        flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS
    )
    
    print(f"Extracted {len(keypoints)} {feature_method.upper()} features")
    
    return keypoints, descriptors, img_with_features

image_path = "sample_image.jpg"  # Replace with your image path

# Extract features with different methods
kp_sift, desc_sift, vis_sift = extract_features(image_path, 'sift')
kp_orb, desc_orb, vis_orb = extract_features(image_path, 'orb')

What I do here is run through an image, and hunt for distinctive patterns that stand out from their surroundings.

These patterns create mathematical “signatures” called descriptors that remain recognizable even when viewed from different angles or distances.

Think of them as unique fingerprints that can be matched across multiple photographs.

The visualization step reveals exactly what the algorithm finds important in your image.

# Display results
plt.figure(figsize=(12, 6))
    
plt.subplot(1, 2, 1)
plt.title(f'SIFT Features ({len(kp_sift)})')
plt.imshow(cv2.cvtColor(vis_sift, cv2.COLOR_BGR2RGB))
plt.axis('off')
    
plt.subplot(1, 2, 2)
plt.title(f'ORB Features ({len(kp_orb)})')
plt.imshow(cv2.cvtColor(vis_orb, cv2.COLOR_BGR2RGB))
plt.axis('off')
    
plt.tight_layout()
plt.show()

Notice how corners, edges, and textured areas attract more keypoints, while smooth or uniform regions remain largely ignored.

This visual feedback is invaluable for understanding why some objects reconstruct better than others.

Geeky Note: The max_features parameter is critical. Setting it too high can dramatically slow processing and capture noise, while setting it too low might miss important details. For most objects, 2000-5000 features provide a good balance, but I’ll push it to 10,000+ for highly detailed architectural reconstructions.

Feature Matching: Connecting Images Together

Once features are extracted, the next step is to find correspondences between images. This process identifies which points in different images represent the same physical point in the real world. Feature matching creates the connections needed to determine camera positions.

I’ve seen countless attempts fail because the algorithm couldn’t reliably connect the same points across different images.

The ratio test is the silent hero that weeds out ambiguous matches before they poison your reconstruction.

#%% SECTION 2: Feature Matching
import cv2
import numpy as np
import matplotlib.pyplot as plt

def match_features(descriptors1, descriptors2, method='flann', ratio_thresh=0.75):
    """
    Match features between two images using different methods.
    """

    # Convert descriptors to appropriate type if needed
    if descriptors1 is None or descriptors2 is None:
        return []
    
    if method.lower() == 'flann':
        # FLANN parameters
        if descriptors1.dtype != np.float32:
            descriptors1 = np.float32(descriptors1)
        if descriptors2.dtype != np.float32:
            descriptors2 = np.float32(descriptors2)
            
        FLANN_INDEX_KDTREE = 1
        index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
        search_params = dict(checks=50)  # Higher values = more accurate but slower
        
        flann = cv2.FlannBasedMatcher(index_params, search_params)
        matches = flann.knnMatch(descriptors1, descriptors2, k=2)
    else:  # Brute Force
        # For ORB descriptors
        if descriptors1.dtype == np.uint8:
            bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=False)
        else:  # For SIFT and SURF descriptors
            bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=False)
        
        matches = bf.knnMatch(descriptors1, descriptors2, k=2)
    
    # Apply Lowe's ratio test
    good_matches = []
    for match in matches:
        if len(match) == 2:  # Sometimes fewer than 2 matches are returned
            m, n = match
            if m.distance < ratio_thresh * n.distance:
                good_matches.append(m)
    
    return good_matches

def visualize_matches(img1, kp1, img2, kp2, matches, max_display=100):
    """
    Create a visualization of feature matches between two images.
    """

    # Limit the number of matches to display
    matches_to_draw = matches[:min(max_display, len(matches))]
    
    # Create match visualization
    match_img = cv2.drawMatches(
        img1, kp1, img2, kp2, matches_to_draw, None,
        flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
    )
    
    return match_img

# Load two images
img1_path = "image1.jpg"  # Replace with your image paths
img2_path = "image2.jpg"
    
# Extract features using SIFT (or your preferred method)
kp1, desc1, _ = extract_features(img1_path, 'sift')
kp2, desc2, _ = extract_features(img2_path, 'sift')
    
# Match features
good_matches = match_features(desc1, desc2, method='flann')
    
print(f"Found {len(good_matches)} good matches")

The matching process works by comparing feature descriptors between two images, measuring their mathematical similarity. For each feature in the first image, we find its two closest matches in the second image and assess their relative distances.

If the closest match is significantly better than the second-best (as controlled by the ratio threshold), we consider it reliable.

# Visualize matches
img1 = cv2.imread(img1_path)
img2 = cv2.imread(img2_path)
match_visualization = visualize_matches(img1, kp1, img2, kp2, good_matches)
    
plt.figure(figsize=(12, 8))
plt.imshow(cv2.cvtColor(match_visualization, cv2.COLOR_BGR2RGB))
plt.title(f"Feature Matches: {len(good_matches)}")
plt.axis('off')
plt.tight_layout()
plt.show()

Visualizing these matches reveals the spatial relationships between your images.

Good matches form a consistent pattern that reflects the transform between viewpoints, while outliers appear as random connections.

This pattern provides immediate feedback on image quality and camera positioning—clustered, consistent matches suggest good reconstruction potential.

Geeky Note: The ratio_thresh parameter (0.75) is Lowe’s original recommendation and works well in most situations. Lower values (0.6-0.7) produce fewer but more reliable matches, which is preferable for scenes with repetitive patterns. Higher values (0.8-0.9) yield more matches but increase the risk of outliers contaminating your reconstruction.

Beautiful, now, let us move at the main stage: the Structure from Motion node.

Structure From Motion: Placing Cameras in Space

Structure from Motion (SfM) reconstructs both the 3D scene structure and camera motion from the 2D image correspondences. This process determines where each photo was taken from and creates an initial sparse point cloud of the scene.

Key steps in SfM include:

Estimating the fundamental or essential matrix between image pairs
Recovering camera poses (position and orientation)
Triangulating 3D points from 2D correspondences
Building a track graph to connect observations across multiple images

The essential matrix encodes the geometric relationship between two camera viewpoints, revealing how they’re positioned relative to each other in space.

This mathematical relationship is the foundation for reconstructing both the camera positions and the 3D structure they observed.

#%% SECTION 3: Structure from Motion
import cv2
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def estimate_pose(kp1, kp2, matches, K, method=cv2.RANSAC, prob=0.999, threshold=1.0):
    """
    Estimate the relative pose between two cameras using matched features.
    """

    # Extract matched points
    pts1 = np.float32([kp1[m.queryIdx].pt for m in matches])
    pts2 = np.float32([kp2[m.trainIdx].pt for m in matches])
    
    # Estimate essential matrix
    E, mask = cv2.findEssentialMat(pts1, pts2, K, method, prob, threshold)
    
    # Recover pose from essential matrix
    _, R, t, mask = cv2.recoverPose(E, pts1, pts2, K, mask=mask)
    
    inlier_matches = [matches[i] for i in range(len(matches)) if mask[i] > 0]
    print(f"Estimated pose with {np.sum(mask)} inliers out of {len(matches)} matches")
    
    return R, t, mask, inlier_matches

def triangulate_points(kp1, kp2, matches, K, R1, t1, R2, t2):
    """
    Triangulate 3D points from two views.
    """

    # Extract matched points
    pts1 = np.float32([kp1[m.queryIdx].pt for m in matches])
    pts2 = np.float32([kp2[m.trainIdx].pt for m in matches])
    
    # Create projection matrices
    P1 = np.dot(K, np.hstack((R1, t1)))
    P2 = np.dot(K, np.hstack((R2, t2)))
    
    # Triangulate points
    points_4d = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)
    
    # Convert to 3D points
    points_3d = points_4d[:3] / points_4d[3]
    
    return points_3d.T

def visualize_points_and_cameras(points_3d, R1, t1, R2, t2):
    """
    Visualize 3D points and camera positions.
    """

    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    
    # Plot points
    ax.scatter(points_3d[:, 0], points_3d[:, 1], points_3d[:, 2], c='b', s=1)
    
    # Helper function to create camera visualization
    def plot_camera(R, t, color):
        # Camera center
        center = -R.T @ t
        ax.scatter(center[0], center[1], center[2], c=color, s=100, marker='o')
        
        # Camera axes (showing orientation)
        axes_length = 0.5  # Scale to make it visible
        for i, c in zip(range(3), ['r', 'g', 'b']):
            axis = R.T[:, i] * axes_length
            ax.quiver(center[0], center[1], center[2], 
                      axis[0], axis[1], axis[2], 
                      color=c, arrow_length_ratio=0.1)
    
    # Plot cameras
    plot_camera(R1, t1, 'red')
    plot_camera(R2, t2, 'green')
    
    ax.set_title('3D Reconstruction: Points and Cameras')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    
    # Try to make axes equal
    max_range = np.max([
        np.max(points_3d[:, 0]) - np.min(points_3d[:, 0]),
        np.max(points_3d[:, 1]) - np.min(points_3d[:, 1]),
        np.max(points_3d[:, 2]) - np.min(points_3d[:, 2])
    ])
    
    mid_x = (np.max(points_3d[:, 0]) + np.min(points_3d[:, 0])) * 0.5
    mid_y = (np.max(points_3d[:, 1]) + np.min(points_3d[:, 1])) * 0.5
    mid_z = (np.max(points_3d[:, 2]) + np.min(points_3d[:, 2])) * 0.5
    
    ax.set_xlim(mid_x - max_range * 0.5, mid_x + max_range * 0.5)
    ax.set_ylim(mid_y - max_range * 0.5, mid_y + max_range * 0.5)
    ax.set_zlim(mid_z - max_range * 0.5, mid_z + max_range * 0.5)
    
    plt.tight_layout()
    plt.show()

Geeky Note: The RANSAC threshold parameter (threshold=1.0) determines how strict we are about geometric consistency. I’ve found that 0.5-1.0 works well for controlled environments, but increasing to 1.5-2.0 helps with outdoor scenes where wind might cause slight camera movements. The probability parameter (prob=0.999) ensures high confidence but increases computation time; 0.95 is sufficient for prototyping.

The essential matrix estimation uses matched feature points and the camera’s internal parameters to calculate the geometric relationship between images.

This relationship is then decomposed to extract rotation and translation information – essentially determining where each photo was taken from in 3D space. The accuracy of this step directly affects everything that follows.


# This is a simplified example - in practice you would use images and matches
# from the previous steps
    
# Example camera intrinsic matrix (replace with your calibrated values)
K = np.array([
        [1000, 0, 320],
        [0, 1000, 240],
        [0, 0, 1]
])
    
# For first camera, we use identity rotation and zero translation
R1 = np.eye(3)
t1 = np.zeros((3, 1))
    
# Load images, extract features, and match as in previous sections
img1_path = "image1.jpg"  # Replace with your image paths
img2_path = "image2.jpg"
    
img1 = cv2.imread(img1_path)
img2 = cv2.imread(img2_path)
    
kp1, desc1, _ = extract_features(img1_path, 'sift')
kp2, desc2, _ = extract_features(img2_path, 'sift')
    
matches = match_features(desc1, desc2, method='flann')
    
# Estimate pose of second camera relative to first
R2, t2, mask, inliers = estimate_pose(kp1, kp2, matches, K)
    
# Triangulate points
points_3d = triangulate_points(kp1, kp2, inliers, K, R1, t1, R2, t2)

Once camera positions are established, triangulation projects rays from matched points in multiple images to determine where they intersect in 3D space.

# Visualize the result
visualize_points_and_cameras(points_3d, R1, t1, R2, t2)

These intersections form the initial sparse point cloud, providing the skeleton upon which dense reconstruction will later build. The visualization shows both the reconstructed points and the camera positions, helping you understand the spatial relationships in your dataset.

SfM works best with a good network of overlapping images. Aim for at least 60% overlap between adjacent images for reliable reconstruction.

Bundle Adjustment: Optimizing for Accuracy

There is an extra optimization stage that comes in within the Structure from Motion “compute node”.

This is called: Bundle adjustment.

It is a refinement step that jointly optimizes camera parameters and 3D point positions. What that means, is that it minimizes the reprojection error, i.e. the difference between observed image points and the projection of their corresponding 3D points.

Does this make sense to you? Essentially, this optimization is great as it permits to:

improves the accuracy of the reconstruction
correct for accumulated drift
Ensures global consistency of the model

At this stage, this should be enough to get a good intuition of how it works.

In larger projects, incremental bundle adjustment (optimizing after adding each new camera) can improve both speed and stability compared to global adjustment at the end.

Dense Matching: Creating Detailed Reconstructions

After establishing camera positions and sparse points, the final step is dense matching to create a detailed representation of the scene.

Dense matching uses the known camera parameters to match many more points between images, resulting in a complete point cloud.

Common approaches include:

Multi-View Stereo (MVS)
Patch-based Multi-View Stereo (PMVS)
Semi-Global Matching (SGM)

Putting It All Together: Practical Tools

The theoretical pipeline is implemented in several open-source and commercial software packages. Each offers different features and capabilities:

Tool	Strengths	Use Case	Pricing
COLMAP	Highly accurate, customizable	Research, precise reconstructions	Free, open-source
OpenMVG	Modular, extensive documentation	Education, integration with custom pipelines	Free, open-source
Meshroom	User-friendly, node-based interface	Artists, beginners	Free, open-source
RealityCapture	Extremely fast, high-quality results	Professional, large-scale projects	Commercial

These tools package the various pipeline steps described above into a more user-friendly interface, but understanding the underlying processes is still essential for troubleshooting and optimization.

Automating the reconstruction pipeline saves countless hours of manual work.

The real productivity boost comes from scripting the entire process end-to-end, from raw photos to dense point cloud.

COLMAP’s command-line interface makes this automation possible, even for complex reconstruction tasks.

#%% SECTION 4: Complete Pipeline Automation with COLMAP
import os
import subprocess
import glob
import numpy as np

def run_colmap_pipeline(image_folder, output_folder, colmap_path="colmap"):
    """
    Run the complete COLMAP pipeline from feature extraction to dense reconstruction.
    """

    # Create output directories if they don't exist
    sparse_folder = os.path.join(output_folder, "sparse")
    dense_folder = os.path.join(output_folder, "dense")
    database_path = os.path.join(output_folder, "database.db")
    
    os.makedirs(output_folder, exist_ok=True)
    os.makedirs(sparse_folder, exist_ok=True)
    os.makedirs(dense_folder, exist_ok=True)
    
    # Step 1: Feature extraction
    print("Step 1: Feature extraction")
    feature_cmd = [
        colmap_path, "feature_extractor",
        "--database_path", database_path,
        "--image_path", image_folder,
        "--ImageReader.camera_model", "SIMPLE_RADIAL",
        "--ImageReader.single_camera", "1",
        "--SiftExtraction.use_gpu", "1"
    ]
    
    try:
        subprocess.run(feature_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Feature extraction failed: {e}")
        return False
    
    # Step 2: Match features
    print("Step 2: Feature matching")
    match_cmd = [
        colmap_path, "exhaustive_matcher",
        "--database_path", database_path,
        "--SiftMatching.use_gpu", "1"
    ]
    
    try:
        subprocess.run(match_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Feature matching failed: {e}")
        return False
    
    # Step 3: Sparse reconstruction (Structure from Motion)
    print("Step 3: Sparse reconstruction")
    sfm_cmd = [
        colmap_path, "mapper",
        "--database_path", database_path,
        "--image_path", image_folder,
        "--output_path", sparse_folder
    ]
    
    try:
        subprocess.run(sfm_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Sparse reconstruction failed: {e}")
        return False
    
    # Find the largest sparse model
    sparse_models = glob.glob(os.path.join(sparse_folder, "*/"))
    if not sparse_models:
        print("No sparse models found")
        return False
    
    # Sort by model size (using number of images as proxy)
    largest_model = 0
    max_images = 0
    for i, model_dir in enumerate(sparse_models):
        images_txt = os.path.join(model_dir, "images.txt")
        if os.path.exists(images_txt):
            with open(images_txt, 'r') as f:
                num_images = sum(1 for line in f if line.strip() and not line.startswith("#"))
                num_images = num_images // 2  # Each image has 2 lines
                if num_images > max_images:
                    max_images = num_images
                    largest_model = i
    
    selected_model = os.path.join(sparse_folder, str(largest_model))
    print(f"Selected model {largest_model} with {max_images} images")
    
    # Step 4: Image undistortion
    print("Step 4: Image undistortion")
    undistort_cmd = [
        colmap_path, "image_undistorter",
        "--image_path", image_folder,
        "--input_path", selected_model,
        "--output_path", dense_folder,
        "--output_type", "COLMAP"
    ]
    
    try:
        subprocess.run(undistort_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Image undistortion failed: {e}")
        return False
    
    # Step 5: Dense reconstruction (Multi-View Stereo)
    print("Step 5: Dense reconstruction")
    mvs_cmd = [
        colmap_path, "patch_match_stereo",
        "--workspace_path", dense_folder,
        "--workspace_format", "COLMAP",
        "--PatchMatchStereo.geom_consistency", "true"
    ]
    
    try:
        subprocess.run(mvs_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Dense reconstruction failed: {e}")
        return False
    
    # Step 6: Stereo fusion
    print("Step 6: Stereo fusion")
    fusion_cmd = [
        colmap_path, "stereo_fusion",
        "--workspace_path", dense_folder,
        "--workspace_format", "COLMAP",
        "--input_type", "geometric",
        "--output_path", os.path.join(dense_folder, "fused.ply")
    ]
    
    try:
        subprocess.run(fusion_cmd, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Stereo fusion failed: {e}")
        return False
    
    print("Pipeline completed successfully!")
    return True

The script orchestrates a series of COLMAP operations that would normally require manual intervention at each stage. It handles the progression from feature extraction through matching, sparse reconstruction, and finally dense reconstruction – maintaining the correct data flow between steps. This automation becomes invaluable when processing multiple datasets or when iteratively refining reconstruction parameters.

# Replace with your image and output folder paths
image_folder = "path/to/images"
output_folder = "path/to/output"
    
# Path to COLMAP executable (may be just "colmap" if it's in your PATH)
colmap_path = "colmap"
    
run_colmap_pipeline(image_folder, output_folder, colmap_path)

One key aspect is the automatic selection of the largest reconstructed model. In challenging datasets, COLMAP sometimes creates multiple disconnected reconstructions rather than a single cohesive model.

The script intelligently identifies and continues with the most complete reconstruction, using image count as a proxy for model quality and completeness.

Geeky Note: The –SiftExtraction.use_gpu and –SiftMatching.use_gpu flags enable GPU acceleration, speeding up processing by 5-10x. For dense reconstruction, the –PatchMatchStereo.geom_consistency true parameter significantly improves quality by enforcing consistency across multiple views, at the cost of longer processing time.

The Power of Understanding the Pipeline

Understanding the full reconstruction pipeline gives you control over your 3D modeling process. When you encounter issues, knowing which stage might be causing problems allows you to target your troubleshooting efforts effectively.

As illustrated, common issues and their sources include:

Missing or incorrect camera poses: Feature extraction and matching problems
Incomplete reconstruction: Insufficient image overlap
Noisy point clouds: Poor bundle adjustment or camera calibration
Failed reconstruction: Problematic images (motion blur, poor lighting)

The ability to diagnose these issues comes from a deep understanding of how each pipeline component works and interacts with others.

Next Steps: Practice and Automation

Now that you understand the pipeline, it’s time to put it into practice. Experiment with the provided code examples and try automating the process for your own datasets.

Start with small, well-controlled scenes and gradually tackle more complex environments as you gain confidence.

Remember that the quality of your input images dramatically affects the final result. Take time to capture high-quality photographs with good overlap, consistent lighting, and minimal motion blur.

Consider starting a small personal project to reconstruct an object you own. Document your process, including the issues you encounter and how you solve them – this practical experience is invaluable.

If you want to build proper expertise, consider
the 3D Reconstructor OS Course ,
or 3D Data Science with Python (O’Reilly)

References and useful resources

I compiled for you some interesting software, tools, and useful algorithm extended documentation:

Software and Tools

COLMAP – Free, open-source 3D reconstruction software
OpenMVG – Open Multiple View Geometry library
Meshroom – Free node-based photogrammetry software
RealityCapture – Commercial high-performance photogrammetry software
Agisoft Metashape – Commercial photogrammetry and 3D modeling software
OpenCV – Computer vision library with feature detection implementations
3DF Zephyr – Photogrammetry software for 3D reconstruction
Python – Programming language ideal for 3D reconstruction automation

Algorithms

SIFT (Scale-Invariant Feature Transform) – Robust feature detection algorithm
SURF (Speeded-Up Robust Features) – Fast feature detection algorithm
ORB (Oriented FAST and Rotated BRIEF) – Efficient alternative to SIFT and SURF
RANSAC (Random Sample Consensus) – Used for outlier rejection in matching
Structure from Motion (SfM) – Algorithm for recovering 3D structure from 2D images
Multi-View Stereo (MVS) – Dense reconstruction algorithm
Bundle Adjustment – Optimization technique for camera poses and 3D points
FLANN (Fast Library for Approximate Nearest Neighbors) – Fast matching algorithm for feature descriptors

About the author

Florent Poux, Ph.D. is a Scientific and Course Director focused on educating engineers on leveraging AI and 3D Data Science. He leads research teams and teaches 3D Computer Vision at various universities. His current aim is to ensure humans are correctly equipped with the knowledge and skills to tackle 3D challenges for impactful innovations.

Resources

The post Master the 3D Reconstruction Process: A Step-by-Step Guide appeared first on Towards Data Science.

Automate Supply Chain Analytics Workflows with AI Agents using n8n

Samir Saci — Wed, 26 Mar 2025 19:48:42 +0000

Why build things the hard way when you can design them the smart way?

As a Supply Chain Data Scientist, I’ve explored various frameworks like LangChain and LangGraph to build AI agents using Python.

Leveraging LLMs with LangChain for Supply Chain Analytics — A Control Tower Powered by GPT — (Image by Samir Saci)

The illustration above is from an article I wrote at the end of 2023, titled “Leveraging LLMs with LangChain for Supply Chain Analytics — A Control Tower Powered by GPT.”

Leveraging LLMs with LangChain for Supply Chain Analytics – A Control Tower Powered by GPT

At the time, I was exploring how to use LangChain to build an agent acting as a Supply Chain Control Tower.

A year later, I discovered the power of the low-code platform n8n to build the same kind of solution in just a few clicks.

AI-Powered Email Parser used for the processing of Warehouse Orders received by Email — (Image by Samir Saci)

In this article, we’ll explore how to easily build AI agents to automate supply chain analytics workflows using n8n.

AI Agent for Supply Chain Control Tower — (Image by Samir Saci)

We’ll also see how to deploy the same AI-powered Control Tower agent I originally built with LangChain 18 months ago — now using only low-code.

AI Agent for Supply Chain Control Towers using LangChain

My first project of AI Automation project using n8n was for a customer who wanted a Supply Chain Control Tower equipped with a chat interface.

A Supply Chain Control Tower is a set of dashboards and reports connected to Warehouse and Transport Management Systems that use data to monitor critical events across the supply chain.

Example of a control

In an earlier article published on Towards Data Science, I experimented with LangChain to connect a control tower to an AI agent.

High-Level Overview of the Solution presented in the article — (Image by Samir Saci)

The idea was to build a plan-and-execute agent that would

Process the user’s request written in plain English
Generate the appropriate SQL query
Query the database and store the results
Formulate a clear response in plain English

After several iterations, I found the right chain structure and prompts to deliver accurate results.

Example of iterations that you can find in the article — (Image by Samir Saci)

The solution worked well because I had already gained experience using LangChain and other frameworks to build AI agents.

How are we supposed to maintain this complex setup?

However, to offer this as a service, I needed tools that would make the solution easier to deploy, maintain, and improve — even with limited Python knowledge.

That’s when I discovered n8n.

Let’s dive into that in the next section.

AI Agent for Supply Chain Control Towers — Built with n8n

What is n8n?

n8n is an open-source workflow automation tool that lets you easily connect apps (email, CRMs, messaging systems), APIs, and AI model frameworks like LangChain.

You build workflows by connecting pre-built nodes.

AI-Powered Email Parser using 4 nodes — (Image by Samir Saci)

For instance, the workflow above processes emails

The first node collects emails from a Gmail account.
The email content and metadata are sent to the AI Agent node, which extracts the relevant information.
The third node processes the output using JavaScript.
The final node loads the results into a Google Sheet.

No code was needed to build this workflow — except for the third node, which uses just two lines of JavaScript.

Since I work with a team of Supply Chain consultants who have limited Python skills, this was a game-changer for me as I looked to develop my service offering.

They can easily use, adapt, and maintain this workflow after a short training session on n8n.

AI Supply Chain Control Tower n8n workflow

The AI Supply Chain Control Tower workflow is a bit more complex — but still far simpler than its Python version.

It includes two sub-workflows.

Main sub-workflow including the AI agent — (Image by Samir Saci)

The main sub-workflow includes both a chat interface and the AI agent.

For the AI Agent node, you need to

Connect an LLM (chat model) using a node where you enter your API credentials
Add a memory node to manage the conversation
Add a tool node for SQL querying, linked to the second sub-workflow

The AI agent generates an SQL query and sends it to the “Call Query Tool” node, which executes the query.

Second sub-workflow connected via the “Call Query Tool” — (Image by Samir Saci)

The sub-workflow includes a code node that cleans the query (removing extra spaces and blocking risky commands like DELETE).

The output is sent to a BigQuery node, which runs the query and returns the results.

The process is very smooth and requires limited configuration:

System Prompt (in the AI Agent node)
User Prompt (in the AI Agent Node)

System Prompt Window of the AI Agent Node — (Image by Samir Saci)

This setup requires no Python skills and can be handled directly by my consultants.

Chat Window showing an interraction with the AI Agent — (Image by Samir Saci)

The results are comparable to those of the Python version.

For step-by-step setup instructions, check out my YouTube tutorial

Conclusion

This example shows how easy it is to replicate an AI agent built with Python — using n8n and minimal code.

Does that mean Python is no longer needed for Supply Chain Analytics? Definitely not!

Like many low-code platforms, the features are limited to what is available within the framework.

That’s why I use it as a complement to my analytics products.

Connect an AI Agent with one of my analytics products’ backend using an HTTP node — (Image by Samir Saci)

To do that, you can use the HTTP Request node to connect your workflow to your analytics backend.

What else? Easy connectivity to many services.

Another reason I chose n8n to enrich my analytics products is how easy it is to add additional connections.

For example, if you want to add a Slack interface or log conversations to a Google Sheet, just add a new node to your workflow.

If you’re starting your n8n journey and need inspiration, feel free to explore my templates.

About Me

Let’s connect on Linkedin and Twitter; I am a Supply Chain Engineer using data analytics to improve Logistics operations and reduce costs.

For consulting or advice on analytics and sustainable Supply Chain transformation, feel free to contact me via Logigreen Consulting.

Samir Saci | Data Science & Productivity
A technical blog focusing on Data Science, Personal Productivity, Automation, Operations Research and Sustainable…samirsaci.com

The post Automate Supply Chain Analytics Workflows with AI Agents using n8n appeared first on Towards Data Science.

Uncertainty Quantification in Machine Learning with an Easy Python Interface

Archit Datar — Wed, 26 Mar 2025 19:14:47 +0000

Uncertainty quantification (UQ) in a Machine Learning (ML) model allows one to estimate the precision of its predictions. This is extremely important for utilizing its predictions in real-world tasks. For instance, if a machine learning model is trained to predict a property of a material, a predicted value with a 20% uncertainty (error) is likely to be used very differently from a predicted value with a 5% uncertainty (error) in the overall decision-making process. Despite its importance, UQ capabilities aren’t available with popular ML software in Python, such as scikit-learn, Tensorflow, and Pytorch.

Enter ML Uncertainty: a Python package designed to address this problem. Built on top of popular Python libraries such as SciPy and scikit-learn, ML Uncertainty provides a very intuitive interface to estimate uncertainties in ML predictions and, where possible, model parameters. Requiring only about four lines of code to perform these estimations, the package leverages powerful and theoretically rigorous mathematical methods in the background. It exploits the underlying statistical properties of the ML model in question, making the package computationally inexpensive. Moreover, this approach extends its applicability to real-world use cases where often, only small amounts of data are available.

Motivation

I have been an avid Python user for the last 10 years. I love the large number of powerful libraries that have been created and maintained, and the community, which is very active. The idea for ML Uncertainty came to me when I was working on a hybrid ML problem. I had built an ML model to predict stress-strain curves of some polymers. Stress-strain curves–an important property of polymers–obey certain physics-based rules; for instance, they have a linear region at low strain values, and the tensile modulus decreases with temperature.

I found from literature some non-linear models to describe the curves and these behaviors, thereby reducing the stress-strain curves to a set of parameters, each with some physical meaning. Then, I trained an ML model to predict these parameters from some easily measurable polymer attributes. Notably, I only had a few hundred data points, as is quite common in scientific applications. Having trained the model, finetuned the hyperparameters, and performed the outlier analysis, one of the stakeholders asked me: “This is all good, but what are the error estimates on your predictions?” And I realized that there wasn’t an elegant way to estimate this with Python. I also realized that this wasn’t going to be the last time that this problem was going to arise. And that led me down the path that culminated in this package.

Having spent some time studying Statistics, I suspected that the math for this wasn’t impossible or even that hard. I began researching and reading up books like Introduction to Statistical Learning and Elements of Statistical Learning^1,2 and found some answers there. ML Uncertainty is my attempt at implementing some of those methods in Python to integrate statistics more tightly into machine learning. I believe that the future of machine learning depends on our ability to increase the reliability of predictions and the interpretability of models, and this is a small step towards that goal. Having developed this package, I have frequently used it in my work, and it has benefited me greatly.

This is an introduction to ML Uncertainty with an overview of the theories underpinning it. I have included some equations to explain the theory, but if those are overwhelming, feel free to gloss over them. For every equation, I have stated the key idea it represents.

Getting started: An example

We often learn best by doing. So, before diving deeper, let’s consider an example. Say we are working on a good old-fashioned linear regression problem where the model is trained with scikit-learn. We think that the model has been trained well, but we want more information. For instance, what are the prediction intervals for the outputs? With ML Uncertainty, this can be done in 4 lines as shown below and discussed in this example.

Illustrating ML uncertainty code (a) and plot (b) for linear regression. Image by author.

All examples for this package can be found here: https://github.com/architdatar/ml_uncertainty/tree/main/examples.

Delving deeper: A peek under the hood

ML Uncertainty performs these computations by having the ParametricModelInference class wrap around the LinearRegression estimator from scikit-learn to extract all the information it needs to perform the uncertainty calculations. It follows the standard procedure for uncertainty estimation, which is detailed in many a statistics textbook,² of which an overview is shown below.

Since this is a linear model that can be expressed in terms of parameters (\( \beta \)) as \( y = X\beta \), ML Uncertainty first computes the degrees of freedom for the model (\( p \)), the error degrees of freedom (\( n – p – 1 \)), and the residual sum of squares (\( \hat{\sigma}^2 \)). Then, it computes the uncertainty in the model parameters; i.e., the variance-covariance matrix.³

\( \text{Var}(\hat{\beta}) = \hat{\sigma}^2 (J^T J)^{-1} \)

Where \( J \) is the Jacobian matrix for the parameters. For linear regression, this translates to:

\( \text{Var}(\hat{\beta}) = \hat{\sigma}^2 (X^T X)^{-1} \)

Finally, the get_intervals function computes the prediction intervals by propagating the uncertainties in both inputs as well as the parameters. Thus, for data \( X^* \) where predictions and uncertainties are to be estimated, predictions \( \hat{y^*} \) along with the \( (1 – \alpha) \times 100\% \) prediction interval are:

\( \hat{y^*} \pm t_{1 – \alpha/2, n – p – 1} \, \hat{\sigma} \sqrt{\text{Var}(\hat{y^*})} \)

Where,

\( \text{Var}(\hat{y^*}) = (\nabla_X f)(\delta X^*)^2(\nabla_X f)^T + (\nabla_\beta f)(\delta \hat{\beta})^2(\nabla_\beta f)^T + \hat{\sigma}^2 \)

In English, this means that the uncertainty in the output depends on the uncertainty in the inputs, uncertainty in the parameters, and the residual uncertainty. Simplified for a multiple linear model and assuming no uncertainty in inputs, this translates to:

\( \text{Var}(\hat{y^*}) = \hat{\sigma}^2 \left(1 + X^* (X^T X)^{-1} X^{*T} \right) \)

Extensions to linear regression

So, this is what goes on under the hood when those four lines of code are executed for linear regression. But this isn’t all. ML Uncertainty comes equipped with two more powerful capabilities:

Regularization: ML Uncertainty supports L1, L2, and L1+L2 regularization. Combined with linear regression, this means that it can cater to LASSO, ridge, and elastic net regressions. Check out this example.
Weighted least squares regression: Sometimes, not all observations are equal. We might want to give more weight to some observations and less weight to others. Commonly, this happens in science when some observations have a high amount of uncertainty while some are more precise. We want our regression to reflect the more precise ones, but cannot fully discard the ones with high uncertainty. For such cases, the weighted least squares regression is used.

Most importantly, a key assumption of linear regression is something known as homoscedasticity; i.e., that the samples of the response variables are drawn from populations with similar variances. If this is not the case, it is handled by assigning weights to responses depending on the inverse of their variance. This can be easily handled in ML Uncertainty by simply specifying the sample weights to be used during training in the y_train_weights parameter of the ParametricModelInference class, and the rest will be handled. An application of this is shown in this example, albeit for a nonlinear regression case.

Basis expansions

I am always fascinated by how much ML we can get done by just doing linear regression properly. Many kinds of data such as trends, time series, audio, and images, can be represented by basis expansions. These representations behave like linear models with many amazing properties. ML Uncertainty can be used to compute uncertainties for these models easily. Check out these examples called spline_synthetic_data, spline_wage_data, and fourier_basis.

Results of ML Uncertainty used for weighted least squares regression, B-Spline basis with synthetic data, B-Spline basis with wage data, and Fourier basis. Image by author.

Beyond linear regression

We often encounter situations where the underlying model cannot be expressed as a linear model. This commonly occurs in science, for instance, when complex reaction kinetics, transport phenomena, process control problems, are modeled. Standard Python packages like scikit-learn, etc., don’t allow one to directly fit these non-linear models and perform uncertainty estimation on them. ML Uncertainty ships with a class called NonLinearRegression capable of handling non-linear models. The user can specify the model to be fit and the class handles fitting with a scikit-learn-like interface which uses a SciPy least_squares function in the background. This can be easily integrated with the ParametericModelInference class for seamless uncertainty estimation. Like linear regression, we can handle weighted least squares and regularization for non-linear regression. Here is an example.

Random Forests

Random Forests have gained significant popularity in the field. They operate by averaging the predictions of decision trees. Decision trees, in turn, identify a set of rules to divide the predictor variable space (input space) and assign a response value to each terminal node (leaf). The predictions from decision trees are averaged to provide a prediction for the random forest.¹ They are particularly useful because they can identify complex relationships in data, are accurate, and make fewer assumptions about the data than regressions do.

While it is implemented in popular ML libraries like scikit-learn, there is no straightforward way to estimate prediction intervals. This is particularly important for regression as random forests, given their high flexibility, tend to overfit their training data. Since random forests doesn’t have parameters like traditional regression models do, uncertainty quantification needs to be performed differently.

We use the basic idea of estimating prediction intervals using bootstrapping as described by Hastie et al. in Chapter 7 of their book Elements of Statistical Learning.² The central idea we can exploit is that the variance of the predictions \( S(Z) \) for some data \( Z \) can be estimated via predictions of its bootstrap samples as follows:

\( \widehat{\text{Var}}[S(Z)] = \frac{1}{B – 1} \sum_{b=1}^{B} \left( S(Z^{*b}) – \bar{S}^{*} \right)^2 \)

Where \( \bar{S}^{*} = \sum_b S(Z^{*b}) / B \). Bootstrap samples are samples drawn from the original dataset repeatedly and independently, thereby allowing repetitions. Lucky for us, random forests are trained using one bootstrap sample for each decision tree within it. So, the prediction from each tree results in a distribution whose variance gives us the variance of the prediction. But there is still one problem. Let’s say we want to obtain the variance in prediction for the \( i^{\text{th}} \) training sample. If we simply use the formula above, some predictions will be from trees that include the \( i^{\text{th}} \) sample in the bootstrap sample on which they are trained. This could lead to an unrealistically smaller variance estimate.

To solve this problem, the algorithm implemented in ML Uncertainty only considers predictions from trees which did not use the \( i^{\text{th}} \) sample for training. This results in an unbiased estimate of the variance.

The beautiful thing about this approach is that we don’t need any additional re-training steps. Instead, the EnsembleModelInference class elegantly wraps around the RandomForestRegressor estimator in scikit-learn and obtains all the necessary information from it.

This method is benchmarked using the method described in Zhang et al.,⁴ which states that a correct \( (1 – \alpha) \times 100\% \) prediction interval is one for which the probability of it containing the observed response is \( (1 – \alpha) \times 100\% \). Mathematically,

\( P(Y \in I_{\alpha}) \approx 1 – \alpha \)

Here is an example to see ML Uncertainty in action for random forest models.

Uncertainty propagation (Error propagation)

How much does a certain amount of uncertainty in input variables and/or model parameters affect the uncertainty in the response variable? How does this uncertainty (epistemic) compare to the inherent uncertainty in the response variables (aleatoric uncertainty)? Often, it is important to answer these questions to decide on the course of action. For instance, if one finds that the uncertainty in model parameters contributes highly to the uncertainty in predictions, one could collect more data or investigate alternative models to reduce this uncertainty. Conversely, if the epistemic uncertainty is smaller than the aleatoric uncertainty, trying to reduce it further might be pointless. With ML uncertainty, these questions can be answered easily.

Given a model relating the predictor variables to the response variable, the ErrorPropagation class can easily compute the uncertainty in responses. Say the responses (\( y \)) are related to the predictor variables (\( X \)) via some function (\( f \)) and some parameters (\( \beta \)), expressed as:

\( y = f(X, \beta) \).

We wish to obtain prediction intervals for responses (\( \hat{y^*} \)) for some predictor data (\( X^* \)) with model parameters estimated as \( \hat{\beta} \). The uncertainty in \( X^* \) and \( \hat{\beta} \) are given by \( \delta X^* \) and \( \delta \hat{\beta} \), respectively. Then, the \( (1 – \alpha) \times 100\% \) prediction interval of the response variables will be given as:

\( \hat{y^*} \pm t_{1 – \alpha/2, n – p – 1} \, \hat{\sigma} \sqrt{\text{Var}(\hat{y^*})} \)

Where,

\( \text{Var}(\hat{y^*}) = (\nabla_X f)(\delta X^*)^2(\nabla_X f)^T + (\nabla_\beta f)(\delta \hat{\beta})^2(\nabla_\beta f)^T + \hat{\sigma}^2 \)

The important thing here is to notice how the uncertainty in predictions includes contributions from the inputs, parameters, as well as the inherent uncertainty of the response.

The ability of the ML Uncertainty package to propagate both input and parameter uncertainties makes it very handy, particularly in science, where we strongly care about the error (uncertainty) in each value being predicted. Consider the often talked about concept of hybrid machine learning. Here, we model known relationships in data through first principles and unknown ones using black-box models. Using ML Uncertainty, the uncertainties obtained from these different methods can be easily propagated through the computation graph.

A very simple example is that of the Arrhenius model for predicting reaction rate constants. The formula \( k = Ae^{-E_a / RT} \) is very well-known. Say, the parameters \( A, E_a \) were predicted from some ML model and have an uncertainty of 5%. We wish to know how much error that translates to in the reaction rate constant.

This can be very easily accomplished with ML Uncertainty as shown in this example.

Illustration of uncertainty propagation through computational graph. Image by author.

Limitations

As of v0.1.1, ML Uncertainty only works for ML models trained with scikit-learn. It supports the following ML models natively: random forest, linear regression, LASSO regression, ridge regression, elastic net, and regression splines. For any other models, the user can create the model, the residual, loss function, etc., as shown for the non-linear regression example. The package has not been tested for neural networks, transformers, and other deep learning models.

Contributions from the open-source ML community are welcome and highly appreciated. While there is much to be done, some key areas of effort are adapting ML Uncertainty to other frameworks such as PyTorch and Tensorflow, adding support for other ML models, highlighting issues, and improving documentation.

Benchmarking

The ML Uncertainty code has been benchmarked against the statsmodels package in Python. Specific cases can be found here.

Background

Uncertainty quantification in machine learning has been studied in the ML community and there is growing interest in this field. However, as of now, the existing solutions are applicable to very specific use cases and have key limitations.

For linear models, the statsmodels library can provide UQ capabilities. While theoretically rigorous, it cannot handle non-linear models. Moreover, the model needs to be expressed in a format specific to the package. This means that the user cannot take advantage of the powerful preprocessing, training, visualization, and other capabilities provided by ML packages like scikit-learn. While it can provide confidence intervals based on uncertainty in the model parameters, it cannot propagate uncertainty in predictor variables (input variables).

Another family of solutions is model-agnostic UQ. These solutions utilize subsamples of training data, train the model repeatedly based on it, and use these results to estimate prediction intervals. While sometimes useful in the limit of large data, these techniques may not provide accurate estimates for small training datasets where the samples chosen might lead to substantially different estimates. Moreover, it is a computationally expensive exercise since the model needs to be retrained multiple times. Some packages using this approach are MAPIE, PUNCC, UQPy, and ml_uncertainty by NIST (same name, different package), among many others.^5–8

With ML Uncertainty, the goals have been to keep the training of the model and its UQ separate, cater to more generic models beyond linear regression, exploit the underlying statistics of the models, and avoid retraining the model multiple times to make it computationally inexpensive.

Summary and future work

This was an introduction to ML Uncertainty—a Python software package to easily compute uncertainties in machine learning. The main features of this package have been introduced here and some of the philosophy of its development has been discussed. More detailed documentation and theory can be found in the docs. While this is only a start, there is immense scope to expand this. Questions, discussions, and contributions are always welcome. The code can be found on GitHub and the package can be installed from PyPi. Give it a try with pip install ml-uncertainty.

References

(1) James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer US: New York, NY, 2021. https://doi.org/10.1007/978-1-0716-1418-1.

(2) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer New York: New York, NY, 2009. https://doi.org/10.1007/978-0-387-84858-7.

(3) Börlin, N. Nonlinear Optimization. https://www8.cs.umu.se/kurser/5DA001/HT07/lectures/lsq-handouts.pdf.

(4) Zhang, H.; Zimmerman, J.; Nettleton, D.; Nordman, D. J. Random Forest Prediction Intervals. Am Stat 2020, 74 (4), 392–406. https://doi.org/10.1080/00031305.2019.1585288.

(5) Cordier, T.; Blot, V.; Lacombe, L.; Morzadec, T.; Capitaine, A.; Brunel, N. Flexible and Systematic Uncertainty Estimation with Conformal Prediction via the MAPIE Library. In Conformal and Probabilistic Prediction with Applications; 2023.

(6) Mendil, M.; Mossina, L.; Vigouroux, D. PUNCC: A Python Library for Predictive Uncertainty and Conformalization. In Proceedings of the Twelfth Symposium on Conformal and Probabilistic Prediction with Applications; Papadopoulos, H., Nguyen, K. A., Boström, H., Carlsson, L., Eds.; Proceedings of Machine Learning Research; PMLR, 2023; Vol. 204, pp 582–601.

(7) Tsapetis, D.; Shields, M. D.; Giovanis, D. G.; Olivier, A.; Novak, L.; Chakroborty, P.; Sharma, H.; Chauhan, M.; Kontolati, K.; Vandanapu, L.; Loukrezis, D.; Gardner, M. UQpy v4.1: Uncertainty Quantification with Python. SoftwareX 2023, 24, 101561. https://doi.org/10.1016/j.softx.2023.101561.

(8) Sheen, D. Machine Learning Uncertainty Estimation Toolbox. https://github.com/usnistgov/ml_uncertainty_py.

\[\]

The post Uncertainty Quantification in Machine Learning with an Easy Python Interface appeared first on Towards Data Science.

Data Science | Towards Data Science

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Is it the same? Is it different?

Can CNNs learn same-different relationships?

Conclusions

Another thing!

Reference

Why CatBoost Works So Well: The Engineering Behind the Magic

Target Statistic

Greedy Target Statistic

Leave One Out Target Statistic

Ordered Target Statistic

Ordered Boosting

Building a Tree

Split Candidates

Oblivious Trees

Conclusion

Further Reading

A Data Scientist’s Guide to Docker Containers

What is a container?

What is Docker?

What do we need to create a Docker container?

Using Docker: An example

1. Build a model

2. Create requirements

3. Create Dockerfile

4. Create Docker Image

5. Run Docker Container

Conclusion

How I Would Learn To Code (If I Could Start Over)

My journey

Choose a language

Learn the bare minimum

Avoid trends

Document your learning

What about AI?

Another thing!

Connect with me

Creating an AI Agent to Write Blog Posts with CrewAI

Introduction

What is a Crew?

Code

Folder Structure

Agents

Tasks

Coding the Crew

Agents Functions

Tasks Functions

Crew

Running the Crew

Before You Go

GitHub Repository

About Me

References

PyScript vs. JavaScript: A Battle of Web Titans

Round 1: What Are They?

Round 2: Performance Battle

JavaScript

PyScript

Round 3: Ease of Use & Readability

Round 4: Ecosystem & Libraries

Final Verdict

The Art of Hybrid Architectures

1. The Strengths and Limitations of CNNs and Transformers

1.1 The Strength of CNNs: Great with Details, Limited in Scope

1.2 The Strength of Transformers: Global Awareness, But Less Precise

1.3 Why a Hybrid Architecture Is Necessary

1.4 The Advantages of a Hybrid Architecture

2. Why I Chose ConvNextV2: Key Innovations Behind the Backbone

2.1 FCMAE Self-Supervised Learning: Adaptive Learning Inspired by the Human Brain

2.2 GRN Global Calibration: Mimicking an Expert’s Attention

2.3 Sparse and Efficient Convolutions: More with Less

2.4 Why ConvNextV2 Was the Right Fit for a Hybrid Architecture

3. Technical Implementation of the MultiHeadAttention Mechanism

3.2 Application of Einstein Summation Convention in Attention Calculation

3.3 Implementation Code Analysis

3.4 How Attention Mechanisms Enhance Understanding of Morphological Feature Relationships

3.4.1. Feature Relationship Modeling

3.4.2. Dynamic Feature Importance Assessment

3.4.3. Complementary Information Integration