Convolution Network | Towards Data Science

The Basis of Cognitive Complexity: Teaching CNNs to See Connections

Salvatore Raieli — Fri, 11 Apr 2025 05:44:46 +0000

Liberating education consists in acts of cognition, not transferrals of information.
Paulo freire

One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing?

Many authors suggest that artificial intelligence models do not possess the same capabilities as humans, especially when it comes to plasticity, flexibility, and adaptation.

One of the aspects that models do not capture are several causal relationships about the external world.

This article discusses these issues:

The parallelism between convolutional neural networks (CNNs) and the human visual cortex
Limitations of CNNs in understanding causal relations and learning abstract concepts
How to make CNNs learn simple causal relations

Is it the same? Is it different?

Convolutional networks (CNNs) [2] are multi-layered neural networks that take images as input and can be used for multiple tasks. One of the most fascinating aspects of CNNs is their inspiration from the human visual cortex [1]:

Hierarchical processing. The visual cortex processes images hierarchically, where early visual areas capture simple features (such as edges, lines, and colors) and deeper areas capture more complex features such as shapes, objects, and scenes. CNN, due to its layered structure, captures edges and textures in the early layers, while layers further down capture parts or whole objects.
Receptive fields. Neurons in the visual cortex respond to stimuli in a specific local region of the visual field (commonly called receptive fields). As we go deeper, the receptive fields of the neurons widen, allowing more spatial information to be integrated. Thanks to pooling steps, the same happens in CNNs.
Feature sharing. Although biological neurons are not identical, similar features are recognized across different parts of the visual field. In CNNs, the various filters scan the entire image, allowing patterns to be recognized regardless of location.
Spatial invariance. Humans can recognize objects even when they are moved, scaled, or rotated. CNNs also possess this property.

The relationship between components of the visual system and CNN. Image source: here

These features have made CNNs perform well in visual tasks to the point of superhuman performance:

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well-trained on the validation images to be better aware of the existence of relevant classes. […] Our result (4.94%) exceeds the reported human-level performance. —source [3]

Although CNNs perform better than humans in several tasks, there are still cases where they fail spectacularly. For example, in a 2024 study [4], AI models failed to generalize image classification. State-of-the-art models perform better than humans for objects on upright poses but fail when objects are on unusual poses.

The right label is on the top of the object, and the AI wrong predicted label is below. Image source: here

In conclusion, our results show that (1) humans are still much more robust than most networks at recognizing objects in unusual poses, (2) time is of the essence for such ability to emerge, and (3) even time-limited humans are dissimilar to deep neural networks. —source [4]

In the study [4], they note that humans need time to succeed in a task. Some tasks require not only visual recognition but also abstractive cognition, which requires time.

The generalization abilities that make humans capable come from understanding the laws that govern relations among objects. Humans recognize objects by extrapolating rules and chaining these rules to adapt to new situations. One of the simplest rules is the “same-different relation”: the ability to define whether two objects are the same or different. This ability develops rapidly during infancy and is also importantly associated with language development [5-7]. In addition, some animals such as ducks and chimpanzees also have it [8]. In contrast, learning same-different relations is very difficult for neural networks [9-10].

Example of a same-different task for a CNN. The network should return a label of 1 if the two objects are the same or a label of 0 if they are different. Image source: here

Convolutional networks show difficulty in learning this relationship. Likewise, they fail to learn other types of causal relationships that are simple for humans. Therefore, many researchers have concluded that CNNs lack the inductive bias necessary to be able to learn these relationships.

These negative results do not mean that neural networks are completely incapable of learning same-different relations. Much larger and longer trained models can learn this relation. For example, vision-transformer models pre-trained on ImageNet with contrastive learning can show this ability [12].

Can CNNs learn same-different relationships?

The fact that broad models can learn these kinds of relationships has rekindled interest in CNNs. The same-different relationship is considered among the basic logical operations that make up the foundations for higher-order cognition and reasoning. Showing that shallow CNNs can learn this concept would allow us to experiment with other relationships. Moreover, it will allow models to learn increasingly complex causal relationships. This is an important step in advancing the generalization capabilities of AI.

Previous work suggests that CNNs do not have the architectural inductive biases to be able to learn abstract visual relations. Other authors assume that the problem is in the training paradigm. In general, the classical gradient descent is used to learn a single task or a set of tasks. Given a task t or a set of tasks T, a loss function L is used to optimize the weights φ that should minimize the function L:

Image source from here

This can be viewed as simply the sum of the losses across different tasks (if we have more than one task). Instead, the Model-Agnostic Meta-Learning (MAML) algorithm [13] is designed to search for an optimal point in weight space for a set of related tasks. MAML seeks to find an initial set of weights θ that minimizes the loss function across tasks, facilitating rapid adaptation:

Image source from here

The difference may seem small, but conceptually, this approach is directed toward abstraction and generalization. If there are multiple tasks, traditional training tries to optimize weights for different tasks. MAML tries to identify a set of weights that is optimal for different tasks but at the same time equidistant in the weight space. This starting point θ allows the model to generalize more effectively across different tasks.

Meta-learning initial weights for generalization. Image source from here

Since we now have a method biased toward generalization and abstraction, we can test whether we can make CNNs learn the same-different relationship.

In this study [11], they compared shallow CNNs trained with classic gradient descent and meta-learning on a dataset designed for this report. The dataset consists of 10 different tasks that test for the same-different relationship.

The Same-Different dataset. Image source from here

The authors [11] compare CNNs of 2, 4, or 6 layers trained in a traditional way or with meta-learning, showing several interesting results:

The performance of traditional CNNs shows similar behavior to random guessing.
Meta-learning significantly improves performance, suggesting that the model can learn the same-different relationship. A 2-layer CNN performs little better than chance, but by increasing the depth of the network, performance improves to near-perfect accuracy.

Comparison between traditional training and meta-learning for CNNs. Image source from here

One of the most intriguing results of [11] is that the model can be trained in a leave-one-out way (use 9 tasks and leave one out) and show out-of-distribution generalization capabilities. Thus, the model has learned abstracting behavior that is hardly seen in such a small model (6 layers).

out-of-distribution for same-different classification. Image source from here

Conclusions

Although convolutional networks were inspired by how the human brain processes visual stimuli, they do not capture some of its basic capabilities. This is especially true when it comes to causal relations or abstract concepts. Some of these relationships can be learned from large models only with extensive training. This has led to the assumption that small CNNs cannot learn these relations due to a lack of architecture inductive bias. In recent years, efforts have been made to create new architectures that could have an advantage in learning relational reasoning. Yet most of these architectures fail to learn these kinds of relationships. Intriguingly, this can be overcome through the use of meta-learning.

The advantage of meta-learning is to incentivize more abstractive learning. Meta-learning pressure toward generalization, trying to optimize for all tasks at the same time. To do this, learning more abstract features is favored (low-level features, such as the angles of a particular shape, are not useful for generalization and are disfavored). Meta-learning allows a shallow CNN to learn abstract behavior that would otherwise require many more parameters and training.

The shallow CNNs and same-different relationship are a model for higher cognitive functions. Meta-learning and different forms of training could be useful to improve the reasoning capabilities of the models.

Another thing!

You can look for my other articles on Medium, and you can also connect or reach me on LinkedIn or in Bluesky. Check this repository, which contains weekly updated ML & AI news, or here for other tutorials and here for AI reviews. I am open to collaborations and projects, and you can reach me on LinkedIn.

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

Lindsay, 2020, Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future, link
Li, 2020, A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects, link
He, 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, link
Ollikka, 2024, A comparison between humans and AI at recognizing objects in unusual poses, link
Premark, 1981, The codes of man and beasts, link
Blote, 1999, Young children’s organizational strategies on a same–different task: A microgenetic study and a training study, link
Lupker, 2015, Is there phonologically based priming in the same-different task? Evidence from Japanese-English bilinguals, link
Gentner, 2021, Learning same and different relations: cross-species comparisons, link
Kim, 2018, Not-so-clevr: learning same–different relations strains feedforward neural networks, link
Puebla, 2021, Can deep convolutional neural networks support relational reasoning in the same-different task? link
Gupta, 2025, Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation, link
Tartaglini, 2023, Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations, link
Finn, 2017, Model-agnostic meta-learning for fast adaptation of deep networks, link

The post The Basis of Cognitive Complexity: Teaching CNNs to See Connections appeared first on Towards Data Science.

How we made EfficientNet more efficient

Dominic Masters — Fri, 25 Jun 2021 12:43:52 +0000

Thoughts and Theory

Image by author.

In our new paper "Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training", we take the state-of-the-art model EfficientNet [1], which was optimised to be – theoretically – efficient, and look at three ways to make it more efficient in practice on IPUs.

For example, adding group convolutions, which have been shown to perform well on IPUs, achieved up to a 3x improvement in practical training throughput with minimal difference in the theoretical compute cost.

Combining all three methods investigated, we achieve up to a 7x improvement in training throughput and 3.6x improvement on inference on IPUs, for comparable validation accuracy.

The theoretical cost of model training, typically measured in FLOPs, is easy to calculate and agnostic to the hardware and software stack being used. These characteristics make it an appealing, complexity measure that has become a key driver in the search for more efficient deep learning models.

In reality, however, there has been a significant disparity between this theoretical measure of the training cost and the cost in practice. This is because a simple FLOP count does not take into account many other important factors, such as the structure of the compute and data movement.

Introducing Group Convolutions

The first method we investigate is how to improve the performance associated with depthwise convolutions (in other words, group convolutions with group size 1). EfficientNet natively uses depthwise convolutions for all spatial convolution operations. They are well known for being FLOP and parameter efficient and so have been successfully utilised in many state-of-the-art convolutional neural networks (CNNs). However, they present several challenges for acceleration in practice.

For example, as each spatial kernel is considered in isolation, the length of the resulting dot product operations, that are typically accelerated by vector multiply-accumulate hardware, is limited. This means this hardware cannot always be fully utilised, resulting in "wasted" cycles.

Depthwise convolutions also have very low arithmetic intensity as they require a significant amount of data transfer relative to the number of FLOPs performed, meaning memory access speed is an important factor. While this can limit throughput on alternative hardware, the IPU’s In-Processor Memory architecture delivers high-bandwidth memory access which can significantly improve performance for low arithmetic intensity operations like these.

Finally, depthwise convolutions have been found to be most effective when they are sandwiched between two dense pointwise "projection" convolutions to form an MBConv block. These pointwise convolutions increase and decrease the dimensionality of the activations by an "expansion factor" of 6 around the spatial depthwise convolution. While this expansion leads to good task performance, it also creates very large activation tensors, which can dominate memory requirements and, ultimately, limit the maximum batch size that can be used.

To address these three issues, we make a simple but significant alteration to the MBConv block. We increase the size of the convolution groups from 1 to 16; this leads to better IPU hardware utilisation. Then, to compensate for the increase in FLOPs and parameters, and address the memory issues, we reduce the expansion ratio to 4. This leads to a more memory efficient and computationally compact version of EfficientNet that we refer to as G16-EfficientNet.

While these alterations were primarily motivated by throughput improvements, we also found that they enabled us to achieve higher ImageNet validation accuracy than the vanilla group size 1 (G1-EfficientNet) baseline across all the model sizes. This modification leads to significant improvements in practical efficiency.

Comparison of theoretical (left) and practical (right) efficiency of G1-EfficientNet (baseline) vs G16 variant (ours). Images by author.

Proxy Normalised Activations

Normalising the outputs of convolution and matrix multiply operations has become an essential element of modern CNNs, with Batch Normalisation the most common form method used for this purpose. However, the constraints on batch size introduced by Batch Norm are a well-known issue that sparked a string of innovations in batch-independent alternatives. While many of these methods work well with ResNet models, we found that none of them achieve the same performance as Batch Norm for EfficientNet.

To address this lack of alternative to Batch Norm, we leverage the novel batch-independent normalisation method Proxy Norm, introduced in a recent paper. This method builds on the already successful methods of Group (and Layer) Normalisation.

Group Norm and Layer Norm suffer from an issue where the activations can become channelwise denormalised. This issue becomes worse with depth since the denormalisation gets accentuated at every layer. While this issue could be avoided by simply reducing the size of groups in Group Norm, such a reduction in the size of groups would, however, alter the expressivity and penalise performance.

Proxy Norm provides a better fix by preserving expressivity while counteracting the two main sources of denormalisation: the affine transformation and the activation function that follow Group Norm or Layer Norm. Concretely, the denormalisation is counteracted by assimilating the outputs of Group Norm or Layer Norm to a Gaussian "proxy" variable and by applying the same affine transformation and the same activation function to this proxy variable. The statistics of the denormalised proxy variable are then used to correct the expected distributional shift in the real activations.

Proxy Norm allows us to maximise the group size (i.e. use Layer Norm) and to preserve expressivity without the issue of channel-wise denormalisation.

Convolution block with additional Proxy-Normalised Activation operations shown in red. Image by author.

This novel normalisation technique is explored in detail in the associated paper [2].

Importantly, this overall approach does not emulate any of the implicit regularisation characteristics of Batch Norm. For this reason, additional regularisation is required – in this work we use a combination of mixup and cutmix. When comparing the performance of Layer Norm + Proxy Norm (LN+PN) to two Batch Norm (BN) baselines with standard pre-processing and AutoAugment (AA), we find that LN+PN matches or exceeds the performance of BN with standard pre-processing across the full range of model sizes. Furthermore, LN+PN is nearly as good as BN with AA, despite AA requiring an expensive process of "training" the augmentation parameters.

Comparison of different normalisation methods for varying sizes of EfficientNet. Image by author.

Reduced Resolution Training

Touvron et al. (2020) [3] showed that significant accuracy gains could be achieved via a post-training fine-tuning of the last few layers using larger images than originally trained on. As this fine-tuning stage is very cheap, it was clear that this would achieve some practical training efficiency benefits. This raised a number of further interesting research questions. How should the training resolution be chosen to maximise efficiency? Given that larger images are slower to test on, how does this impact efficiency at inference?

To investigate these questions, we compared training at two different resolutions, either the "native" resolution (as defined in the original EfficientNet work) or at approximately half the pixel count. We then fine-tuned and tested at a broad range of image sizes. This allowed us to investigate the direct effect of training resolution on efficiency and determine the Pareto optimal combinations that achieved the best speed-accuracy trade-offs for training and inference.

When comparing training efficiency, we considered two testing scenarios: testing on the Native resolution or selecting the "best" resolution to maximise validation accuracy across the full sweep of resolutions.

When testing at the native resolution, we see that training with half-size images yields considerable theoretical and practical efficiency improvements. Remarkably, for a given model size, we find that training at half resolution and fine-tuning at the native resolution even yields higher final accuracy than training, fine-tuning and testing all at the native resolution. This conclusion suggests that, for ImageNet training, we should always be testing at a higher resolution than we train at. We now hope to understand if this applies to other domains too.

If we next allow ourselves to test at the "best" image resolution, we see that training at native resolution yields a significant improvement in final accuracy, narrowing the gap in the Pareto front.

It should, however, be noted that to achieve this, the "best" testing resolutions for the "native" training scheme end up being much larger than those that correspond to the half training resolution cases. This means they will be more expensive at inference time.

Images by author.

These results highlight the improvements to training efficiency achieved by the three improvements investigated (i) group convolutions [G16 (ours) vs G1]; (ii) proxy normalized activations [LN+PN (ours) vs GN] and (iii) half resolution training [Half (ours) vs Native]. Note that the baseline results have no fine-tuning and use the native image resolution.

Comparing the efficiency of inference on its own, we see that training at half resolution yields Pareto-optimal efficiency across the full range of accuracies. This is a remarkable result as there is no direct FLOP advantage in inference at all. Furthermore, the points along the half-resolution inference-efficiency Pareto front remain optimal for training throughput.

Theoretical and practical inference efficiency. Tested at all resolutions; lines highlight Pareto fronts. Images by author.

Across all efficiency metrics, the models with Proxy Norm perform either equivalently to or slightly better than the models with Group Norm. This stems from the improved accuracy at only a small cost in throughput of ~10%. Importantly, however, models with Proxy Norm use fewer parameters across the whole Pareto front, highlighting an additional benefit of Proxy Norm in terms of efficiency with respect to model size.

How to make EfficientNet more efficient

In carrying out this research, we have looked at several modifications to the EfficientNet model to improve the overall efficiency in training and inference:

By adding group convolutions and reducing the expansion ratio in the MBConv blocks, we have improved IPU hardware utilisation of the spatial convolutions and reduced the memory consumption.
By training with images of half the resolution, we have cut training time and remarkably achieved better final accuracy.
By leveraging the novel normalisation method Proxy Norm, we matched Batch Norm performance without any dependency on the batch information. To our knowledge, this is the first method to achieve this for EfficientNet.

Using all these methods in combination, we have achieved up to a 7x improvement in practical training efficiency and 3.6x improvement in practical inference efficiency on IPU. These results show that EfficientNet can deliver training and inference efficiency when using hardware suited to processing group convolutions, like IPU, taking it beyond the theory and towards practical, real-world applications.

Read the paper

Thank you

Thank you to Antoine Labatie, Zach Eaton-Rosen and Carlo Luschi who also contributed to this research, and thank you to our other colleagues at Graphcore for their support and insights.

References

[1] M. Tan, Q. V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (2019), arXiv 2019

[2] A. Labatie, D. Masters, Z. Eaton-Rosen, C. Luschi, Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence (2021), arXiv 2021

[3] H. Touvron, A. Vedaldi, M. Douze, H. Jégou, Fixing the train-test resolution discrepancy: FixEfficientNet (2020), NeurIPS 2019

The post How we made EfficientNet more efficient appeared first on Towards Data Science.

Convolutional Neural Networks, Explained

Mayank Mishra — Wed, 26 Aug 2020 19:32:52 +0000

Photo by Christopher Gower on Unsplash

A Convolutional Neural Network, also known as Cnn or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be.

_Figure 1: Representation of image as a grid of pixels (Source)_

The human brain processes a huge amount of information the second we see an image. Each neuron works in its own receptive field and is connected to other neurons in a way that they cover the entire visual field. Just as each neuron responds to stimuli only in the restricted region of the visual field called the receptive field in the biological vision system, each neuron in a CNN processes data only in its receptive field as well. The layers are arranged in such a way so that they detect simpler patterns first (lines, curves, etc.) and more complex patterns (faces, objects, etc.) further along. By using a CNN, one can enable sight to computers.

Convolutional Neural Network Architecture

A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.

Figure 2: Architecture of a CNN (Source)

Convolution Layer

The convolution layer is the core building block of the CNN. It carries the main portion of the network’s computational load.

This layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field. The kernel is spatially smaller than an image but is more in-depth. This means that, if the image is composed of three (RGB) channels, the kernel height and width will be spatially small, but the depth extends up to all three channels.

Illustration of Convolution Operation (source)

During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region. This produces a two-dimensional representation of the image known as an activation map that gives the response of the kernel at each spatial position of the image. The sliding size of the kernel is called a stride.

If we have an input of size W x W x D and Dout number of kernels with a spatial size of F with stride S and amount of padding P, then the size of output volume can be determined by the following formula:

Formula for Convolution Layer

This will yield an output volume of size Wout x _Wou_t x Dout.

Figure 3: Convolution Operation (Source: Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville)

Motivation behind Convolution

Convolution leverages three important ideas that motivated Computer Vision researchers: sparse interaction, parameter sharing, and equivariant representation. Let’s describe each one of them in detail.

Trivial neural network layers use matrix multiplication by a matrix of parameters describing the interaction between the input and output unit. This means that every output unit interacts with every input unit. However, convolution neural networks have sparse interaction. This is achieved by making kernel smaller than the input e.g., an image can have millions or thousands of pixels, but while processing it using kernel we can detect meaningful information that is of tens or hundreds of pixels. This means that we need to store fewer parameters that not only reduces the memory requirement of the model but also improves the statistical efficiency of the model.

If computing one feature at a spatial point (x1, y1) is useful then it should also be useful at some other spatial point say (x2, y2). It means that for a single two-dimensional slice i.e., for creating one activation map, neurons are constrained to use the same set of weights. In a traditional neural network, each element of the weight matrix is used once and then never revisited, while Convolution Network has shared parameters i.e., for getting output, weights applied to one input are the same as the weight applied elsewhere.

Due to parameter sharing, the layers of convolution neural network will have a property of equivariance to translation. It says that if we changed the input in a way, the output will also get changed in the same way.

Pooling Layer

The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. The pooling operation is processed on every slice of the representation individually.

There are several pooling functions such as the average of the rectangular neighborhood, L2 norm of the rectangular neighborhood, and a weighted average based on the distance from the central pixel. However, the most popular process is max pooling, which reports the maximum output from the neighborhood.

Figure 4: Pooling Operation (Source: O’Reilly Media)

If we have an activation map of size W x W x D, a pooling kernel of spatial size F, and stride S, then the size of output volume can be determined by the following formula:

Formula for Padding Layer

This will yield an output volume of size Wout x Wout x D.

In all cases, pooling provides some translation invariance which means that an object would be recognizable regardless of where it appears on the frame.

Fully Connected Layer

Neurons in this layer have full connectivity with all neurons in the preceding and succeeding layer as seen in regular FCNN. This is why it can be computed as usual by a matrix multiplication followed by a bias effect.

The FC layer helps to map the representation between the input and the output.

Non-Linearity Layers

Since convolution is a linear operation and images are far from linear, non-linearity layers are often placed directly after the convolutional layer to introduce non-linearity to the activation map.

There are several types of non-linear operations, the popular ones being:

1. Sigmoid

The sigmoid non-linearity has the mathematical form σ(κ) = 1/(1+e¯κ). It takes a real-valued number and "squashes" it into a range between 0 and 1.

However, a very undesirable property of sigmoid is that when the activation is at either tail, the gradient becomes almost zero. If the local gradient becomes very small, then in backpropagation it will effectively "kill" the gradient. Also, if the data coming into the neuron is always positive, then the output of sigmoid will be either all positives or all negatives, resulting in a zig-zag dynamic of gradient updates for weight.

2. Tanh

Tanh squashes a real-valued number to the range [-1, 1]. Like sigmoid, the activation saturates, but – unlike the sigmoid neurons – its output is zero centered.

3. ReLU

The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the function ƒ(κ)=max (0,κ). In other words, the activation is simply threshold at zero.

In comparison to sigmoid and tanh, ReLU is more reliable and accelerates the convergence by six times.

Unfortunately, a con is that ReLU can be fragile during training. A large gradient flowing through it can update it in such a way that the neuron will never get further updated. However, we can work with this by setting a proper learning rate.

Designing a Convolutional Neural Network

Now that we understand the various components, we can build a convolutional neural network. We will be using Fashion-MNIST, which is a dataset of Zalando’s article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. The dataset can be downloaded here.

Our convolutional neural network has architecture as follows:

[INPUT]

→[CONV 1] → [BATCH NORM] → [ReLU] → [POOL 1]

→ [CONV 2] → [BATCH NORM] → [ReLU] → [POOL 2]

→ [FC LAYER] → [RESULT]

For both conv layers, we will use kernel of spatial size 5 x 5 with stride size 1 and padding of 2. For both pooling layers, we will use max pool operation with kernel size 2, stride 2, and zero padding.

Calculations for Conv 1 Layer (Image by Author)

Calculations for Pool1 Layer (Image by Author)

Calculations for Conv 2 Layer (Image by Author)

Calculations for Pool2 Layer (Image by Author)

Size of Fully Connected Layer (Image by Author)

Code snipped for defining the convnet

class convnet1(nn.Module):
    def __init__(self):
        super(convnet1, self).__init__()

        # Constraints for layer 1
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride = 1, padding=2)
        self.batch1 = nn.BatchNorm2d(16)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2) #default stride is equivalent to the kernel_size

        # Constraints for layer 2
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5, stride = 1, padding=2)
        self.batch2 = nn.BatchNorm2d(32)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2)

        # Defining the Linear layer
        self.fc = nn.Linear(32*7*7, 10)

    # defining the network flow
    def forward(self, x):
        # Conv 1
        out = self.conv1(x)
        out = self.batch1(out)
        out = self.relu1(out)

        # Max Pool 1
        out = self.pool1(out)

        # Conv 2
        out = self.conv2(out)
        out = self.batch2(out)
        out = self.relu2(out)

        # Max Pool 2
        out = self.pool2(out)

        out = out.view(out.size(0), -1)
        # Linear Layer
        out = self.fc(out)

        return out

We have also used batch normalization in our network, which saves us from improper initialization of weight matrices by explicitly forcing the network to take on unit Gaussian distribution. The code for the above-defined network is available here. We have trained using cross-entropy as our loss function and the Adam Optimizer with a learning rate of 0.001. After training the model, we achieved 90% accuracy on the test dataset.

Applications

Below are some applications of Convolutional Neural Networks used today:

Object detection: With CNN, we now have sophisticated models like R-CNN, Fast R-CNN, and Faster R-CNN that are the predominant pipeline for many object detection models deployed in autonomous vehicles, facial detection, and more.
Semantic segmentation: In 2015, a group of researchers from Hong Kong developed a CNN-based Deep Parsing Network to incorporate rich information into an image segmentation model. Researchers from UC Berkeley also built fully convolutional networks that improved upon state-of-the-art semantic segmentation.
Image captioning: CNNs are used with recurrent neural networks to write captions for images and videos. This can be used for many applications such as activity recognition or describing videos and images for the visually impaired. It has been heavily deployed by YouTube to make sense to the huge number of videos uploaded to the platform on a regular basis.

References

Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville published by MIT Press, 2016
Stanford University’s Course – CS231n: Convolutional Neural Network for Visual Recognition by Prof. Fei-Fei Li, Justin Johnson, Serena Yeung
https://datascience.stackexchange.com/questions/14349/difference-of-activation-functions-in-neural-networks-in-general
https://www.codementor.io/james_aka_yale/convolutional-neural-networks-the-biologically-inspired-model-iq6s48zms
https://searchenterpriseai.techtarget.com/definition/convolutional-neural-network

The post Convolutional Neural Networks, Explained appeared first on Towards Data Science.