What is the hierarchical structure of a convolutional neural network

Structure of Convolutional Neural Networks

The basic structure of a convolutional neural network consists of the following parts: input layer, convolutional layer, pooling layer, activation function layer and fully connected layer.

Convolutional Neural Networks (CNN) are a class of feedforward neural networks that contain convolutional computation and have a deep structure, and are one of the representative algorithms of deeplearning.

Convolutional neural networks have the ability of representation learning and can perform shift-invariant classification of input information according to their hierarchical structure, so they are also known as “Shift-InvariantArtificial Neural Networks” (Shift-InvariantArtificial Neural Networks). Shift-Invariant Artificial Neural Networks (SIANN)”.

Research on convolutional neural networks began in the 1980s and 1990s, and time-delay networks and LeNet-5 were the first convolutional neural networks to appear; in the twenty-first century, with the introduction of the theory of deep learning and the improvement of numerical computation equipment, convolutional neural networks have been developed rapidly, and have been applied to computer vision, natural language processing and other fields.

Convolutional neural networks are modeled after the visualperception mechanism of living creatures, and can be used for both supervised and unsupervised learning. The sharing of convolutional kernel parameters within the implicit layers and the sparsity of inter-layer connections allow convolutional neural networks to be able to use a small amount of computation on grid-liketopology features.


Connections between convolutional layers in a convolutional neural network are known as sparseconnection, i.e., neurons in a convolutional layer are connected to only some, but not all, of their neighboring layers, as opposed to full connectivity in a feedforward neural network.

Specifically, any pixel in the feature map of layer l of a convolutional neural network is only a linear combination of pixels within the receptive field defined by the convolutional kernel in layer l-1. The sparse connections of the convolutional neural network have a regularization effect, which improves the stability and generalization of the network structure and avoids overfitting.

Convolutional Neural Networks Commonly Understood

Convolutional Neural Networks are commonly understood as follows:

Convolutional Neural Networks (CNNs)-Structure

①CNNN structure generally contains these layers:

Input Layer: used for the input of the data

Convolutional Layer: uses convolution kernel for feature extraction and feature mapping

Excitation Layer: Since convolution is also a linear operation, nonlinear mapping needs to be added

Pooling layer: downsampling is performed to sparse the feature map and reduce the amount of data operations.

Fully Connected Layer: Usually re-fitting in the tail of CNN to reduce the loss of feature information

Output Layer: Used for outputting the result

②There are also some other functional layers in the middle that can be used:

Normalization Layer (BatchNormalization): Normalization of the features in the CNN

Slice Split Layer: separate learning of certain (image) data in separate regions

Fusion Layer: fusion of branches that learn features independently

Please click to enter a description of the image

Convolutional Neural Networks (CNNs) – Input Layer

1) The input format of the input layer of a CNN preserves the structure of the image itself.

②For a black-and-white 28×28 picture, the input to the CNN is a 28×28 two-dimensional neuron.

3) For a 28×28 picture in RGB format, the input to the CNN is a 3×28×28 three-dimensional neuron (each color channel in RGB has a 28×28 matrix)

2) Convolutional Neural Networks (CNNs)-Convolutional Layer

Feeling Horizons

1) In the Convolutional Layer there are several important concepts:



2 Assuming that the input is a 28×28 two-dimensional neuron, we define 5×5 localreceptivefields, i.e., the neurons in the hidden layer are the same as the 5×5 neurons in the input layer. neurons are connected to the 5×5 neurons in the input layer, and this 5×5 region is called LocalReceptiveFields,

Lecture9 Convolutional Neural Network Architecture

First review LeNet-5, which has had great success in the field of digit recognition, with the network structure [CONV-POOL-CONV-POOL-FC-FC]. The convolutional layer uses a 5×5 convolutional kernel with a step size of 1; the pooling layer uses a 2×2 region with a step size of 2; and it is followed by a fully connected layer. This is shown in the following figure:

And AlexNet in 2012, the first large CNN network to win the ImageNet competition, has a very similar structure to LeNet-5, except that the number of layers has become more – [CONV1-MAXPOOL1-NORM1-CONV2 -MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MaxPOOL3-FC6-FC7-FC8], there are five convolutional layers, three pooling layers, two normalization layers and three fully connected layers. It is shown below:

The reason why it was split into two parts, top and bottom, was because the GPU capacity at that time was so small that it could only be done with two. Some more details are:

AlexNet improved its correctness rate by almost 10% when it won the ImageNet competition 2012, and the winner in 2013 was ZFNet, which used the same network architecture as AlexNet, only with further tuning of the hyperparameters:

This reduced the error rate from 16.4% to 11.7%

GoogLeNet and VGG, the winners and runners-up in ’14, have 22 and 19 layers, respectively; here’s how to describe each.

VGG uses smaller convolutional kernels and deeper layers compared to AlexNet.VGG has both 16 and 19 layers. The convolution kernel uses only 3×3 with a step of 1 and a pad of 1; the pooled region is 2×2 with a step of 2.

So why use a small 3×3 convolution kernel?

Here’s a look at the parameters and memory usage of VGG-16:

Some of the details of the VGG network are:

Here’s a look at the #1 name in classification, GoogLeNet.

First, some of the details of GoogLeNet:

” The “Inception” module is a well-designed LAN topology, which is then stacked on top of each other.

This topology applies a number of different filtering operations, such as 1×1 convolution, 3×3 convolution, 5×5 convolution, and 3×3 pooling, in parallel to the input from the previous layer. The outputs of all the filters are then concatenated together in depth. This is shown below:

But one problem with this structure is that the computational complexity is greatly increased. Consider, for example, the following network setup:

The inputs are 28x28x256 and the outputs of the concatenation are 28x28x672. (Assuming that each filtering operation maintains the input size by zero-padding.) And the computational expense is also very high:

Because the pooling operation maintains the depth of the original inputs, the network’s outputs are bound to increase in depth. The solution is to add a “bottleneck layer” before the convolution operation, which uses a 1×1 convolution to reduce the depth while preserving the size of the original input space, as long as the number of convolution kernels is less than the depth of the original input.

Using this structure, with the same network parameters, does reduce the amount of computation:

The final output is 28x28x480. The total amount of computation at this point is:

The Inceptionmole is stacked vertically, and for ease of description, the model is placed horizontally:

The total number of parameterized layers is therefore 3+18+1=22. +1 = 22 layers. In addition, the layers in the orange section are not counted in the total number of layers, and both pieces have the following structure:AveragePool5x5+3(V)-Conv1x1+1(S)-FC-FC-SoftmaxActivation-Output.The strong performance of this relatively shallow network on this classification task suggests that the middle layers of the network produce features that should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, we expect to encourage differentiation in the lower stages of the classifier, increase the returned gradient signal, and provide additional regularization. These auxiliary classifiers use smaller convolutional kernels placed on top of the output of the third and sixth Inceptionmole. During training, their loss is added to the total network loss of the discounted weights (the loss of the auxiliary classification is weighted at 0.3). At prediction time, these auxiliary networks are discarded.” –quote from the original paper

Starting in 2015, the number of layers in the network exploded, with the ’15-’17 winners having 152 layers, beginning the “depth revolution.”

ResNet is a very deep network that uses residual connections. Here are the details:

Is ResNet performing so well just because it’s deep? The answer is no; studies have shown that a 56-layer convolutional layer-stacked network has larger training and testing errors than a 20-layer network, and that it’s not overfitting that’s the cause, but rather that it’s harder to optimize a deeper network. But a deeper model can perform at least as well as a shallower one, and if you want to turn a shallower layer into a deeper one, you can build it in the following way: copy the original shallower layer into the deeper one, and add some mapping layers equal to itself. Now the deeper model can learn better.

ResNet learns the residualmapping between inputs and input-outputs by using multiple referential layers, instead of using referential layers to learn the underlyingmapping between inputs and outputs directly, as is done in general CNN networks (e.g. AlexNet/VGG, etc.).

If the input is set to X, and the mapping of a parametric network layer is set to H, then the output of that layer with X as input will be H(X). The usual CNN network will learn the expression of the parametric function H directly through training, thus directly obtaining the mapping of X to H(X). Residual learning, on the other hand, is devoted to learning the mapping of inputs to the residuals (H(X)-X) between inputs and outputs using multiple participatory network layers, i.e., learning X->(H(X)-X), and then adding X’s own mapping (identitymapping). That means the output of the network is still H(X)-X+X=H(X), just that the learning is only (H(X)-X), and the X part is directly itself mapping.

The residual learning unit establishes a direct correlation channel between inputs and outputs through the introduction of the per se mapping, thus allowing the powerful participant layer to concentrate on learning the residuals between inputs and outputs. Generally we use to denote the residual mapping, then the output of the residual learning unit is. When the number of input and output channels is the same, it is natural to directly use X for summation. When the number of channels between them is different, we need to consider the establishment of an effective self-mapping function so that the processed input X and output Y have the same number of channels.

When the number of channels between X and Y is different, there are two ways of mapping themselves. One is to simply zero out the missing channels of X relative to Y so that they can be aligned, and the other is to represent the Ws mapping by using a 1×1 convolution so that the final input and output channels are the same.

The complete network structure is as follows:

For the ResNet-50+ network, a GoogLeNet-like “bottleneck layer” is used for computational efficiency. Like the Inception module, the feature map dimension is subtly reduced or expanded by using 1×1 convolution so that the number of kernels in the 3×3 convolution is not affected by the inputs of the previous layer, and its output does not affect the next layer. However, it is designed purely to save computation time and thus reduce the time required to train the entire model, and has no impact on the final model accuracy.

The actual training of ResNet is as follows:

The actual training result is that a lot of layers can be stacked without loss of accuracy: 152 on ImageNet, 1202 on CIFAR. Now as expected, the deeper the network, the higher the training accuracy. Sweeping all the 2015 awards and exceeding human recognition rates for the first time.

The left graph below compares the accuracy of various networks by Top1 accuracy; the right graph shows the computational complexity of the different networks, with the horizontal axis being the amount of computation and the size of the circle indicating the memory footprint. Where Inception-v4 is Resnet+Inception.

The graph shows:

Forward propagation time and power consumption can also be compared: