Each layer of a convolutional neural network serves a different purpose

What is the role of pooling layer in Convolutional Neural Network?

What is the role of pooling layer in convolutional neural network as follows:

The pooling layer is an important part of CNN, and its role is to perform dimensionality reduction and feature extraction on the output of convolutional layer. Specifically, the pooling layer can downscale the output feature map by operating on the maximum or average of the local regions of the convolutional layer output, reducing the number of parameters and improving the generalization ability of the model.

1. Types of Pooling Layers

There are various types of pooling layers, including maximum pooling, average pooling, L2 pooling and so on. Among them, maximum pooling is the most commonly used type, which can extract the most significant features and reduce the redundant information of the feature map through the operation of taking the maximum value of the local region of the convolutional layer output.

2. Parameters of Pooling Layer

Parameters of the pooling layer include the pooling kernel size, the step size, and the filling method. Among them, the pooling kernel size is the size of the pooling layer to operate on the local region of the output of the convolutional layer, the step size is the step length of the pooling layer when operating on the output of the convolutional layer, and the padding method is the way of padding the edges during the pooling operation.

3. Advantages of Pooling Layer

The pooling layer has several advantages. First, it reduces redundant information in the feature graph and improves the generalization ability of the model. Second, it can reduce the computational complexity of the model and speed up the training of the model. Finally, it can improve the robustness of the model and make the model tolerant to small changes in the input data.

4. Disadvantages of Pooling Layer

The pooling layer also has some disadvantages. First, it loses some detailed information that may affect the performance of the model. Secondly, it may introduce some errors that may cause the accuracy of the model to decrease. Finally, it may lead to a smaller size of the feature map, making the model require a higher resolution of the input data.

5. Conclusion

The pooling layer is an important component in CNNs, which serves to perform dimensionality reduction and feature extraction on the output of the convolutional layer. The pooling layer has multiple types and parameters that can be selected according to the specific application scenario. Pooling layers have multiple advantages and disadvantages that need to be weighed and selected in practical applications.

Convolutional neural network’s convolutional layer, activation layer, pooling layer, fully connected layer

The data input is a picture (input layer), CONV means convolutional layer, RELU means excitation layer, POOL means pooling layer, and Fc means fully connected layer

A fully connected neural network requires a very large amount of computational resources in order to support it to do backpropagation and forward propagation, so that fully Connected neural networks can store a very large number of parameters, if you give it samples if it does not reach its magnitude, it can easily write down all the samples you give it, which will appear overfitting.

So we should bring down the number of weights of the connections between neurons and neurons, but we can’t guarantee that it has a strong learning ability, so this is a tangled place, so there is a way to do this is to localize the connections + weight sharing, localization of the connections + weight sharing is not only the weight parameter is brought down, and the learning ability is not materially reduced, and in addition to other benefits, look down. In addition to other benefits, let’s take a look at the following images:

Different representations of an image

These images depict a single thing, but some are large, some are small, some are to the left, some are to the right, and some are in different positions, but the network we build to recognize them all should have the same result. In order to be able to achieve this, we can make the different positions of the images have the same weight (weight sharing), that is, all the images above, we only need to put one in the training set, our neural network can recognize all of the above, this is the benefit of weight sharing.

And a convolutional neural network is a locally connected + weight sharing neural network.

Now we have a preliminary understanding of the convolutional neural network, the following specific to explain the convolutional neural network, the convolutional neural network is still a hierarchical structure, but the function of the layer and the form of the change, the convolutional neural network is often used to deal with the picture data, such as identifying a car:

In the picture output to the neural network, often the first image processing, there are three common There are three common ways of processing images:

Averaging and normalization

Decorrelation and whitening

Images have a property called local correlation, where a pixel in an image is most affected by pixels in its immediate vicinity, and pixels farther away from that pixel have little to do with it. This property means that each neuron we do not have to deal with the global picture (and the upper layer of the full connection), each of our neurons only need to connect with the upper layer of the local connection, which is equivalent to each neuron scanning a small area, and then a number of neurons (which neurons share the same weights) is equivalent to the scanning of the global, which constitutes a feature map, the n features map extracted from the picture of n-dimensional features, each feature map is composed of a number of neurons, and each feature map is composed of a number of neurons. features, each feature map is completed by many neurons.

In a convolutional neural network, we first select a local region (filter), and use this local region (filter) to scan the whole picture. All the nodes circled by the local region are connected to a node in the next layer. Let’s take a grayscale image (only one dimension) as an example:

Local region

The image is matrix-like, and these nodes arranged in a matrix are spread into vectors. It will be better to see the connection between the convolutional layer and the input layer, it is not fully connected, we will call the red box in the picture above filter, it is 2*2, this is its size, this is not fixed, we can specify its size.

We can see the current filter is a small window of 2*2, this small window will slide the image matrix from the top left corner to the bottom right corner, every slide will be circled four at a time, connected to a neuron in the next layer, and then produce four weights, the matrix formed by these four weights (w1, w2, w3, w4) is called the convolution kernel.

The convolution kernel is learned by the algorithm itself, it will be calculated with the previous layer, for example, the value of node 0 in the second layer is a linear combination of the local region (w10+w21+w34+w45), that is, the values of the nodes in the circle are multiplied by the corresponding weights and then added together.

Convolution kernel computation

Convolution operation

We said earlier that images are not represented by vectors in order to preserve information about the planar structure of the image. In the same way, the output of the convolution will lose the planar structure if we use the vector arrangement in the above figure. So we still arrange them in a matrix, and we get the connections shown below, with four yellow nodes connected to each blue node.

Convolutional layer connection

The picture is a matrix and then the next layer of the convolutional neural network is also a matrix, we use a convolutional kernel from the top left corner of the picture matrix to the bottom right corner of the slide, every slide, of course, is circled by neurons will be connected to the next layer of a neuron, the formation of the parameter matrix this is the convolutional kernel, each slide, although circled by different neurons, connected to the next layer of neurons, the kernel is not the same, and the neurons are connected to the next layer. Each slide, although the neurons circled are different, and the neurons connected to the next layer are different, but the resulting parameter matrix is the same, which is the weight sharing.

The convolution kernel and the scanning of the picture of the local matrix role to produce a value, such as the first time, (w10 + w21 + w34 + w45), so the filter from the upper left to the lower right of the process will be a matrix (which is the next layer is also a matrix of reasons), the specific process is shown as follows:

Convolution calculations process

The left side of the above figure is the picture matrix, the size of the filter we use is 33, the first slide, the convolution kernel and the picture matrix role (11+10+11+00+11+10+01+00+11) = 4, will produce a value, this value is the first value of the matrix on the right side, the filter slides 9 times after that, it will produce 9 values, which is to say that the next There are 9 neurons in the next layer, and the values generated by these 9 neurons form a matrix, which is called the feature map, and represents the features of a certain dimension of the image, of course, the exact dimension may not be known, it may be the color of the image, or the contour of the image and so on.

Single-channel image summary: the above is a single-channel image of the convolution process, the image is a matrix, we use the specified size of the convolution kernel from the upper left corner to the lower right corner to slide, each slide will be circled by a node connected to the next layer of a node connected to form a local connection, each connection will produce weights, these weights are convolution kernels, so each slide will be produce a convolution kernel, because the weights are shared, these convolution kernels are all the same. The convolution kernel will keep interacting with the local matrix circled by the convolution kernel at that time, and the value generated each time will be the value of the node in the next layer, so the value generated many times will be combined to form a feature map, which represents the features of a certain dimension. That is, the process of sliding from the upper left to the lower right will form a feature map matrix (sharing a convolutional kernel), and then sliding from the upper left to the lower right will form another feature map matrix (sharing another convolutional kernel), these feature maps are to represent a certain dimension of the feature.

How does a convolution operation work for a three-channel image?

So far we should have known how a single-channel grayscale image is processed, in fact all our images are RGB images with three channels, so how is the image convolved at this point?

Color image

filter window slide, we just from the width and height of the angle to slide, and does not take into account the depth, so each slide is actually generated by a convolution kernel, sharing this one convolution kernel, and now depth = 3, so each slide is actually generated by the convolution kernel has three channels (they act on the blue, green, and red channels of the input image), one channel of the convolution kernel acts on the blue matrix to produce a value, another acts on the green matrix to produce a value, and the last acts on the red matrix to produce a value, and then these values are added up to be the value of the next layer of the node, and the result is also a matrix, which is a feature map.

The three-channel process

To have more than one feature map, we can then use a new convolution kernel to do a top-left to bottom-right slide, which will form a new feature map.

The convolution process for a three-channel image

That is to say, adding one convolution kernel produces a feature map. In general, how many channels the input image has, our convolution kernel needs to correspond to how many channels, and how many convolution kernels in the layer produces as many feature maps as there are in the layer. So that the output of the convolution can be sent as a new input to another convolutional layer for processing, there are a few feature maps then depth is a few, then the next layer of each feature map will have to correspond to the corresponding channel of the convolution kernel to deal with this logic to be clear, we need to understand the basic concepts first:

Convolutional computation of the formula

4×4 picture at the edge of the Zeropadding a circle, and then 3×3 after the convolution of the filter, the resulting FeatureMap size is still 4×4 unchanged.


Of course, you can also use a 5×5 filter and a zeropadding of 2 to keep the original size of the image, the 3×3 filter takes into account the relationship between the pixel and all other pixels up to a distance of 1, while the 5×5 takes into account the relationship between the pixel and all other pixels up to a distance of 2.

Regularity: the size of the FeatureMap is equal to

(input_size+2*padding_size – filter_size)/stride+1

We can summarize the role of the convolutional layers a little bit: convolutional layers are actually extracting the features, and the most important of the layers is the convolutional layer. The most important thing in the layer is the convolutional kernel (trained), different convolutional kernels can detect specific shapes, colors, contrast, etc. Then the feature map maintains the spatial structure after the capture, so different convolutional kernels corresponding to the feature map represents a certain dimension of the features, the specific what features may not be known to us. If the feature map is convolved as an input, it can be used to detect “larger” shape concepts, which means that as the number of layers of the convolutional neural network increases, the feature extraction becomes more and more specific.

The role of the excitation layer can be understood as a nonlinear mapping of the results of the convolutional layers.

Excitation layer

The f in the figure above represents the excitation function, commonly used excitation function a few of the following:

Commonly used excitation function

We will first look at the excitation function Sigmoid derivatives of the minimum of 0, the maximum of 1/4,

Excitation Function Sigmoid

Tanh Activation Function: and sigmoid, it will correspond up and down about the x-axis, not biased one way or the other

Tanh activation function

ReLU activation function (Modified Linear Units): converges fast and finds the gradient quickly, but is fragile, with the gradient on the left side at 0

ReLU activation function

LeakyReLU activation function: does not saturate or hang up, the calculation is also very fast, but the amount of computation is relatively large

LeakyReLU activation function

Some tips for the use of excitation function: generally do not use sigmoid, try RELU first, because it is fast, but be careful, if RELU fails, please use LeakyReLU, in some cases tanh is a good result.

This is the excitation layer of the convolutional neural network, it is the nonlinear mapping of the result of the linear computation of the convolutional layer. It can be understood from the diagram below. It shows the application of nonlinear operations to a feature map. The output feature map can also be seen as a “modified” feature map. This is shown below:

Nonlinear operations

Pooling layer: Reduces the dimensionality of the individual feature maps, but maintains a large fraction of the important information. The pooling layer is sandwiched between successive convolutional layers, compressing the amount of data and parameters to reduce overfitting. The pooling layer does not have parameters, it merely downsamples (data compression) the results given to it by the upper layers. There are two common ways of downsampling:

Maxpooling: pick the largest, we define a spatial neighborhood (e.g., a 2×2 window) and take the largest element from the corrected feature maps within the window, maximum pooling has been shown to work a little better.

Averagepooling: average, we define a spatial neighborhood (e.g., a 2×2 window) and calculate the average from the corrected feature map within the window


We have to pay attention to one point: pooling is performed separately at different depths, i.e., if depth If =5, pooling is performed 5 times, resulting in 5 pooled matrices, and pooling does not require parameter control. The pooling operation is applied separately to each feature map, we can get five output maps from five input maps.

Pooling operation

Both maxpool and averagepool have sub-information discarded, so part of the information discarded will damage the recognition results?

Because the convolved FeatureMap has redundant information that is not necessary for recognizing the object, we downsample to remove this redundant information, so it will not damage the recognition result.

Let’s take a look at how the redundant information is generated after convolution.

We know that the convolution kernel is designed to find a specific dimension of information, such as a certain shape, but the shape does not appear anywhere in the image, but the convolution kernel in the convolution process does not appear in a specific shape of the picture location convolution will also produce a value, but the significance of this value will not be very large, so we use the role of the pooling layer, the value is removed, and naturally, will not damage the recognition results. will not harm the recognition result.

For example, in the figure below, suppose the convolution kernel detects the shape “horizontal fold”. In the 3×3 FeatureMap obtained after convolution, the only really useful node is the one with the number 3, and the rest of the values are irrelevant for this task. So Maxpooling with 3×3 has no effect on the detection of the “horizontal fold”. Imagine in this example if we don’t use Maxpooling and let the network learn on its own. The network will also learn the weights that approximate the effect of Maxpooling. Because it is an approximate effect, it adds the cost of more parameters, yet it is not as good as just doing Maxpooling.


All neurons are weight-connected in a fully-connected layer, usually the fully-connected layer at the tail of the convolutional neural network. Once the previous convolutional layers have captured enough features to be used to recognize the image, the next step is how to classify it. Usually the end of the convolutional network flattens the rectangle obtained at the end into a long vector and feeds it into the fully connected layer in conjunction with the output layer for classification. For example, in the figure below we are performing image classification as a four classification problem, so the output layer of the convolutional neural network will have four neurons.

Quadruple classification problem

We explain a convolutional neural network in terms of its input, convolutional, activation, pooling, and fully-connected layers, and we can think of the fully-connected layers as doing the feature extraction between them, and the fully-connected layers as doing the classification, which is the heart of a convolutional neural network.

Convolutional Neural Networks

Notes about convolutional networks in the flower book are recorded at https://www.jianshu.com/p/5a3c90ea0807.

A convolutional neural network (CNN or ConvNet) is a type of locally connected , weight sharing and other properties of deep feedforward neural networks. Convolutional Neural Network is proposed by the mechanism of biological Receptive Field. ReceptiveField mainly refers to the properties of some neurons in the auditory, visual and other nervous systems, i.e., neurons only receive signals within the stimulus region they innervate.

Convolutional neural networks were first used mainly to process image information. When fully-connected feedforward networks are used to process images, there are two problems:

Current convolutional neural networks are generally feedforward neural networks consisting of a cross-stack of convolutional, convergent, and fully-connected layers that are trained using a back-propagation algorithm. Convolutional neural networks have three structural properties: local connectivity, weight sharing, and convergence. These properties give the convolutional neural network a degree of translation, scaling, and rotation invariance.

Convolution is an important operation in analytical mathematics. In signal processing or image processing, one- or two-dimensional convolution is often used.

One-dimensional convolution is often used in signal processing to calculate the delay accumulation of a signal. Suppose that a signal generator produces a signal at each moment t, and that the decay rate of the information is such that, after a time step, the information is a multiple of the original. Assuming that, then the signal received at moment t is the superposition of the information generated at the current moment and the delayed information from previous moments:

We refer to this as a Filter or ConvolutionKernel. Assuming that the filter is of length, its convolution with a signal sequence is:

The convolution of a signal sequence and a filter is defined as:

In general the length of the filter is much smaller than the length of the signal sequence, the following figure gives an example of a one-dimensional convolution, with a filter:

Two-dimensional convolution is often used in image processing. Because the image is a two-dimensional structure, the one-dimensional convolution needs to be extended. Given an image and a filter, the convolution is:

The following figure gives an example of a two-dimensional convolution:

Note that the convolution operation here is not about framing a convolution kernel-sized box in the image and multiplying the pixel values by the individual elements of the convolution kernel and then summing them together, but rather, it’s about rotating the kernel by 180 degrees, and doing the above operation again.

In image processing, convolution is often used as an effective method for feature extraction. An image obtained after a convolution operation is called a FeatureMap.

The top filter is a Gaussian filter, which can be used to smooth and denoise the image; the middle and bottom filters can be used to extract edge features.

In the field of machine learning and image processing, the main function of convolution is to slide a convolution kernel (i.e., a filter) over an image (or some kind of feature), and obtain a new set of features through the convolution operation. In the process of computing convolution, a convolution kernel flip (i.e., the 180-degree rotation mentioned above) is required. In the specific implementation, the convolution is generally replaced by a mutual correlation operation, which will reduce some unnecessary operations or overhead.

Cross-Correlation is a function that measures the correlation of two sequences, and is usually implemented as a dot product computation with a sliding window. Given an image and a convolution kernel, their cross-correlation is:

The difference between cross-correlation and convolution is only whether or not the convolution kernel is flipped. Thus mutual correlation can also be referred to as non-flip convolution. Convolution and mutual off are equivalent when the convolution kernel is a learnable parameter. Therefore, for implementation (or descriptive) convenience, we use mutual correlation instead of convolution. In fact, many of the convolution operations in deep learning tools are actually mutual-gate operations.

Based on the standard definition of convolution, sliding steps and zero padding of filters can also be introduced to increase convolutional diversity and more flexibility in feature extraction.

The filter’s step (Stride) is the time interval at which the filter is sliding.

ZeroPadding is zero-padding at both ends of the input vector.

Suppose that the convolution layer has a number of input neurons, a convolution size, a step size, and zeros filled at each end of the neurons.

There are three types of convolutions commonly used in general:

Because the training of convolutional networks is also based on the back-propagation algorithm, let’s focus on the derivative properties of convolutions:


,. function is a scalar function.

Then by having:

It can be seen that the partial derivative about is the convolution of and:

It is similarly obtained that:

When or ,, i.e., it is equivalent to the zero-padding carried out on. Thereby the partial derivatives with respect to are the wide convolution of and .

Expressed in terms of the “convolution” of correlations, this is (note the commutative nature of the wide convolution operation):

In a fully-connected feed-forward neural network, if there’s a neuron in the first layer, a neuron in the first layer, and a neuron in the first layer, there’s one on the connecting side, i.e., there’s a parameter to the weight matrix. When and are both large, the weight matrix has very many parameters, and training can be very inefficient.

If convolution is used instead of full connectivity, the net input to the first layer is the convolution of the first layer’s activity value and the filter, i.e.

Based on the definition of convolution, there are two very important properties of convolutional layers:

Because of the local connectivity and the sharing of the weights, the convolutional layer has only one parameter, an m-dimensional weight and a 1-dimensional bias, for a total of one parameter. The number of parameters is independent of the number of neurons. In addition, the number of neurons in the first layer is not chosen arbitrarily, but is satisfied.

The role of the convolutional layer is to extract features from a local region, and different convolutional kernels correspond to different feature extractors.

FeatureMap (FeatureMap) for an image (or other feature maps) in the features extracted by convolution, each feature map can be used as a class of extracted image features. In order to improve the representation capability of the convolutional network, multiple different feature maps can be used at each layer to better represent the features of the image.

In the input layer, the feature mapping is the image itself. If it’s a grayscale image, it’s the one feature mapping with depth; if it’s a color image, it’s the feature mapping with depth for each of the three RGB color channels.

Without loss of generality, suppose a convolutional layer is structured as follows:

In order to compute the output feature mapping, a convolutional kernel is used to convolve the input feature mappings separately, and then the results of the convolution are summed up and a scalar bias is added to get the net input of the convolutional layer and then the output feature mapping is obtained after a nonlinear activation function.

In a convolutional layer with inputs and outputs, each output feature map requires a filter and a bias. Assuming the size of each filter, a total of one parameter is required.

The Pooling Layer, also known as the Subsampling Layer, is used to perform feature selection, reducing the number of features and thus the number of parameters.

There are two commonly used aggregation functions:

Where is the activation value for each neuron in the region.

It can be seen that the convergence layer not only effectively reduces the number of neurons, but also allows the network to remain invariant to some small local morphological changes and to have a larger receptive field.

A typical convergence layer divides each feature mapping into non-overlapping regions of size, which are then downsampled using maximum convergence. The convergence layer can also be viewed as a special convolutional layer with a convolutional kernel of size and step size, with the convolutional kernel being a function or functions. Too large a sampling region drastically reduces the number of neurons and can cause too much information loss.

A typical convolutional network is a cross-stack of convolutional, convergence, and fully connected layers.

The current commonly used convolutional network structure is shown in the figure, a convolutional block is consecutive convolutional layers and a convergence layer (usually set to, for or). A convolutional network can be stacked with consecutive convolutional blocks, followed by a fully-connected layer (with a large value range, such as or larger; usually set to).

Currently, there is a tendency to use smaller convolutional kernels (e.g., and ) and deeper structures (e.g., layers greater than 50) throughout the network structure. In addition, the role of convergence layers has become less useful as well, due to the increasingly flexible operationalization of convolution (e.g., different step sizes), and thus the proportion of convergence layers in the more popular convolutional networks nowadays is gradually decreasing, tending towards all-convolutional networks.

In fully-connected feedforward neural networks, the gradient is mainly back-propagated through the error term in each layer, and the gradient of the parameters in each layer is further computed. In a convolutional neural network, there are two main neural layers with different functions: the convolutional layer and the convergence layer. While the parameters are the convolution kernel as well as the bias, so only the gradient of the parameters in the convolution layer needs to be computed.

Without loss of generality, the first layer is a convolutional layer, the input feature mapping of the first layer is, and the net input of the feature mapping of the first layer is obtained through convolutional computation, and the net input of the first feature mapping of the first layer is obtained

By the same reasoning, the partial derivatives of the loss function with respect to the first bias of the first layer are:

In a convolutional network, the gradient of each layer’s parameters depends on the error term of the layer in which it is placed.

The error terms are calculated differently in the convolutional and convergence layers, so we calculate their error terms separately.

The specific derivation of the error term for the first feature mapping in the first layer is as follows:

Where is the derivative of the activation function used in the first layer, and is the upsampling function (upsampling), which is just the opposite of the downsampling operation used in the convergence layer. If the downsampling is maxpooling, each value of the error term is passed directly to the neuron corresponding to the maximum value in the corresponding region of the previous layer, and the error terms of the other neurons in the region are set to 0. If the downsampling is meanpooling, each value of the error term is equally distributed to all neurons in the corresponding region of the previous layer.

The exact derivation of the error term for the first feature mapping of the first layer is as follows:

Where is wide convolution.

LeNet-5 is a very successful neural network model although it was proposed earlier. A handwritten digit recognition system based on LeNet-5 was used by many banks in the United States in the 1990s to recognize handwritten digits on top of checks.The network structure of LeNet-5 is shown in the figure:

Excluding the input layers, there are seven layers in LeNet-5, and each layer is structured as follows:

AlexNet was the first modern deep convolutional network model, and its first use of many of the technical approaches of modern deep convolutional networks, such as the use of ReLU as a nonlinear activation function, the use of Dropout to prevent overfitting, and the use of data augmentation to improve model accuracy, etc. AlexNet won the 2012 ImageNet image classification competition.

The structure of AlexNet is shown in the figure, including five convolutional layers, three fully connected layers and one softmax layer. Because the size of the network exceeded the memory limitations of a single GPU at the time, AlexNet split the network in half and placed it on two separate GPUs, which communicated with each other only on certain layers (such as layer 3).

The exact structure of AlexNet is as follows:

In a convolutional network, how to set the size of the convolution kernel of a convolutional layer is a very critical issue. In Inception networks, a convolutional layer contains multiple convolutional operations of different sizes, called Inception modules.Inception networks are made up of a stack with multiple inception modules and a small number of convergence layers.

In the v1 version of Inception module, four parallel sets of feature extraction are used, which are 1×1, 3×3, 5×5 convolution and 3×3 maximum convergence. Meanwhile, in order to improve the computational efficiency and reduce the number of parameters, the Inception module performs a 1×1 convolution to reduce the depth of feature mappings before performing 3×3 and 5×5 convolution and after 3×3 maximum convergence. If there is redundant information between the input feature mappings, a 1×1 convolution is equivalent to performing a feature extraction first.

An article on Convolutional Neural Networks-CNN (Basic Principles + Unique Value + Practical Applications)

Before the advent of CNNs, images were a difficult problem for Artificial Intelligence for 2 reasons:

The amount of data that needs to be processed for images is too large, resulting in high costs and low efficiency

Images are digitized in a way that makes it difficult to retain original features It is difficult to retain the original features, leading to poor accuracy in image processing

The following explains these 2 problems in detail:

Images are made up of pixels, and each pixel is made up of color.

Nowadays, a random image is 1000×1000 pixels or more, and each pixel has RGB3 parameters to represent color information.

If we process an image of 1000×1000 pixels, we need to process 3 million parameters!


Such a large amount of data is very resource-intensive to process, and it’s just not a very big image!

The first problem that Convolutional Neural Networks-CNN solves is to ‘simplify complex problems’ by downscaling a large number of parameters into a small number of parameters, and then doing the processing.

What’s more: we have most scenarios where downscaling doesn’t affect the results. For example, a 1,000-pixel image reduced to 200 pixels doesn’t affect the ability of the naked eye to recognize whether the image is a cat or a dog, and neither does the machine.

The traditional way of digitizing an image, simplified, is similar to the process shown below:

If the presence of a circle is a 1, and the absence of a circle is a 0, then a difference in the location of the circle will result in a completely different representation of the data. But from a visual point of view, the content (essence) of the image does not change, only the position.

So when we move the objects in the image, the parameters obtained in the traditional way will be very different! This is not compatible with image processing.

The CNN solves this problem by retaining the features of the image in a visual-like manner, so that when the image is flipped, rotated, or shifted in position, it can be effectively recognized as a similar image.

So how are Convolutional Neural Networks implemented? Before we get into the principles of CNNs, let’s take a look at what the principles of human vision are.

Many of the findings of deep learning cannot be separated from the study of the cognitive principles of the brain, especially the principles of vision.

The 1981 Nobel Prize in Medicine was awarded to David Hubel (a Canadian-born American neurobiologist) and Torsten Wiesel, as well as to Roger Sperry, whose main contribution was “the discovery of information processing in the visual system”. ” that the visual cortex is hierarchical.

Human vision works as follows: it starts with raw signal intake (the pupil takes in pixel Pixels), then it does preliminary processing (certain cells in the cortex discover edges and orientations), then it abstracts (the brain decides that the shape of the object in front of it, is round), then it abstracts further (the brain further decides that the object is only a balloon). Here’s an example of how the human brain performs face recognition:

For different objects, human vision is also cognized through such a layer-by-layer hierarchy:

We can see that at the lowest level the features are basically similar, that is, the various edges, and the higher up, the more we can extract some of the features of such objects (wheels, eyes, torsos, etc.), and to the uppermost level , different high-level features are eventually combined to form a corresponding image, thus enabling humans to accurately distinguish between different objects.

So we can naturally think: can we mimic this feature of the human brain by constructing a multi-layer neural network, where the lower layers recognize the primary image features, and a number of lower layer features are combined to form a higher layer of features, which are ultimately combined through multiple layers to eventually classify at the top layer?

The answer is yes, and this is the inspiration for many deep learning algorithms, including CNNs.

The typical CNN consists of 3 parts:

Convolutional layer

Pooling layer

Whole-connectivity layer

If you describe it in simple terms:

Convolutional layer is responsible for extracting localized features in the image. is used to drastically reduce the parameter magnitude (dimensionality reduction); and the fully connected layer resembles the part of a traditional neural network and is used to output the desired result.

The following explanation of the principles ignores a lot of technical details in order to make it easy to understand, if you are interested in the detailed principles, you can watch this video “Fundamentals of Convolutional Neural Networks”.

The operation of the convolutional layer is shown below, with a convolutional kernel sweeping the whole image:

This process can be understood as we use a filter (convolutional kernel) to filter the various small regions of the image, so as to get the eigenvalues of these small regions.

In specific applications, there are often more than one convolution kernel, you can think of, each convolution kernel represents an image pattern, if a certain image block with this convolution kernel convolution of the value of a large, it is considered that this image block is very close to this convolution kernel. If we design 6 convolutional kernels, it can be understood that we consider that there are 6 underlying texture patterns on this image, that is, we can depict an image with 6 underlying patterns. Here is an example of 25 different convolutional kernels:

Summary: the convolutional layer’s extracts localized features in the image through filtering of the convolutional kernels, similar to the feature extraction of human vision mentioned above.

Pooling layer is simply downsampling, he can greatly reduce the dimensionality of the data. The process is as follows:

In the figure above, we can see that the original image is 20×20, we downsample it, the sampling window is 10×10, and ultimately downsampling it into a 2×2 size feature map.

The reason why we do this is because even after doing the convolution, the image is still large (because the convolution kernel is relatively small), so the downsampling is done in order to reduce the data dimension.

Summary: The pooling layer can reduce the data dimension more effectively than the convolutional layer, and doing so not only reduces the amount of computation, but also avoids overfitting.

This part is the last step, after the convolutional layer and pooling layer processed data input to the fully connected layer, to get the final results.

After the convolutional layer and pooling layer dimensionality reduction of the data, the fully connected layer can “run”, otherwise the amount of data is too large, the cost of computation is high, inefficient.

The typical CNN is not just a 3-layer structure, as mentioned above, but a multi-layer structure, such as the structure of LeNet-5 as shown below:

Convolutional layer – Pooling layer – Convolutional layer – Pooling layer – Convolutional layer -Fully Connected Layer

After understanding the fundamentals of CNN, let’s focus on what are the practical applications of CNN.

Convolutional Neural Networks – CNNs are very good at processing images. And video is an overlay of images, so it’s equally good at processing video content. Here’s a list of some of the more mature applications�:

Image classification, retrieval

Image classification is a relatively basic application, he can save a lot of labor costs, the image will be effectively classified. For some domain-specific images, the accuracy of classification can reach 95%+, which is already considered a highly usable application.

Typical scenarios: image search…

Target localization detection

Can locate a target in an image and determine the location and size of the target.

Typical scenarios: autonomous driving, security, medical …

Target segmentation

Simply understood is a pixel-level classification.

He can distinguish between foreground and background at the pixel level, and at a more advanced level, he can recognize and classify the target.

Typical scenarios: beauty show, video post-processing, image generation …

Face Recognition

Face recognition is already a very popular application, and has a wide range of applications in many fields.

Typical scenarios: security, finance, life…

Skeletal Recognition

Skeletal Recognition is the ability to recognize key bones in the body, as well as tracking the movement of the bones.

Typical scenarios: security, movies, image video generation, games…

Today we introduced the value, basic principles and application scenarios of CNN, which are briefly summarized as follows:

The value of CNN:

Ability to effectively downsize large data volume images into small data volume images Effective downsizing into small data volume (does not affect the results)

Ability to retain the features of the picture, similar to the principle of human vision

Basic principle of CNN:

Convolution layer – the main role is to retain the features of the picture

Pooling layer – the main role is to downsize the data, which can effectively avoid overfitting

Fully-connected layer – according to the different tasks to output the results we want

CNN’s practical applications:

Picture classification, retrieval

Target Localization Detection

Target Segmentation

Face Recognition

Skeletal Recognition

This article was first published in easyAI-Artificial Intelligence Knowledgebase

“An article on Convolutional Neural Networks- CNN (Fundamentals + Unique Values + Practical Applications)