Convolutional neural network framework structure analysis diagram

An illustration of the principle of cnn

An illustration of the principle of cnn is the use of convolution (convolve) and activationfunction (activationfunction) for feature extraction to obtain a new type of neural network model.

Extended Knowledge:

ConvolutionalNeuralNetwork (CNN) is a deep learning model specialized in processing data with a grid-like structure, such as images, speech signals, etc. CNN can effectively reduce the model through the principles of local awareness, shared weights, and step-by-step computation. principles, it can effectively reduce the number of parameters of the model and improve the generalization ability and computational efficiency of the model.

In CNN, each neuron is connected to only a local region of the input data, i.e., each neuron receives only a small part of the input data. This localized perception allows the CNN to better capture the local features of the input data and reduces the number of parameters, decreasing the complexity of the model.

In CNNs, the same convolution kernel (also known as a filter or weight matrix) can be reused multiple times to perform convolution operations on the input data. This strategy of sharing weights not only reduces the number of parameters in the model, but also enhances the spatial invariance of the model to the input data.

The step-by-step computation of CNN consists of two steps: convolutional computation and activation function. In convolutional computation, each neuron multiplies the input data point-by-point with the convolutional kernel and then adds up the results as the neuron’s output; in the activation function, the neuron’s output is passed through a nonlinear function (e.g., ReLU) to increase the expressive power of the model.

Pooling operation (Pooling) is an important technique in CNNs to reduce the dimensionality of the input data while retaining important features. Pooling operations can be Maximum Pooling, Average Pooling, L2 Paradigm Pooling, etc. Pooling operations can enhance the robustness of the model and reduce the risk of overfitting.

CNNs usually consist of multiple hierarchies, each of which consists of multiple convolutional layers and pooling layers alternately. These hierarchies can enable CNNs to gradually extract more advanced features. For example, in image recognition tasks, the bottom layer of a CNN can extract basic features such as edges and textures, and the higher layers can extract advanced features such as the shape and position of objects.

The CNN training process uses a backpropagation algorithm to update the weights of the model. The backpropagation algorithm can be divided into two stages: forward propagation and backpropagation. In forward propagation, the input data passes through the convolutional and pooling layers of the CNN to get the output results; in backpropagation, the error gradient of each neuron is calculated based on the difference between the output results and the true label, and then the weights of each neuron are updated.

In short, CNN, as a powerful deep learning model, has been widely used in image processing, natural language processing, speech recognition and other fields. It can effectively reduce the complexity of the model and improve the generalization ability and computational efficiency of the model through the principles of local perception, shared weights, step-by-step computation, pooling operation and multi-level structure.

Lecture9 Convolutional Neural Network Architecture

First review LeNet-5, which has had great success in the field of digit recognition, with the network structure [CONV-POOL-CONV-POOL-FC-FC]. The convolutional layer uses a 5×5 convolutional kernel with a step size of 1; the pooling layer uses a 2×2 region with a step size of 2; and it is followed by a fully connected layer. This is shown in the following figure:

And AlexNet in 2012, the first large CNN network to win the ImageNet competition, has a very similar structure to LeNet-5, except that the number of layers has become more – [CONV1-MAXPOOL1-NORM1-CONV2 -MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MaxPOOL3-FC6-FC7-FC8], there are five convolutional layers, three pooling layers, two normalization layers and three fully connected layers. It is shown below:

The reason why it was split into two parts, top and bottom, was because the GPU capacity at that time was so small that it could only be done with two. Some more details are:

AlexNet improved its correctness rate by almost 10% when it won the ImageNet competition 2012, and the winner in 2013 was ZFNet, which used the same network architecture as AlexNet, only with further tuning of the hyperparameters:

This reduced the error rate from 16.4% to 11.7%

GoogLeNet and VGG, the winners and runners-up in ’14, have 22 and 19 layers, respectively; here’s how to describe each.

VGG uses smaller convolutional kernels and deeper layers compared to AlexNet.VGG has both 16 and 19 layers. The convolution kernel uses only 3×3 with a step of 1 and a pad of 1; the pooled region is 2×2 with a step of 2.

So why use a small 3×3 convolution kernel?

Here’s a look at the parameters and memory usage of VGG-16:

Some of the details of the VGG network are:

Here’s a look at the #1 name in classification, GoogLeNet.

First, some of the details of GoogLeNet:

” The “Inception” module is a well-designed LAN topology, which is then stacked on top of each other.

This topology applies a number of different filtering operations, such as 1×1 convolution, 3×3 convolution, 5×5 convolution, and 3×3 pooling, in parallel to the input from the previous layer. The outputs of all the filters are then concatenated together in depth. This is shown below:

But one problem with this structure is that the computational complexity is greatly increased. Consider, for example, the following network setup:

The inputs are 28x28x256 and the outputs of the concatenation are 28x28x672. (Assuming that each filtering operation maintains the input size by zero-padding.) And the computational expense is also very high:

Because the pooling operation maintains the depth of the original inputs, the network’s outputs are bound to increase in depth. The solution is to add a “bottleneck layer” before the convolution operation, which uses a 1×1 convolution to reduce the depth while preserving the size of the original input space, as long as the number of convolution kernels is less than the depth of the original input.

Using this structure, with the same network parameters, does reduce the amount of computation:

The final output is 28x28x480. The total amount of computation at this point is:

The Inceptionmole is stacked vertically, and for ease of description, the model is placed horizontally:

The total number of parameterized layers is therefore 3+18+1=22. +1 = 22 layers. In addition, the layers in the orange section are not counted in the total number of layers, and both pieces have the following structure:AveragePool5x5+3(V)-Conv1x1+1(S)-FC-FC-SoftmaxActivation-Output.The strong performance of this relatively shallow network on this classification task suggests that the middle layers of the network produce features that should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, we expect to encourage differentiation in the lower stages of the classifier, increase the returned gradient signal, and provide additional regularization. These auxiliary classifiers use smaller convolutional kernels placed on top of the output of the third and sixth Inceptionmole. During training, their loss is added to the total network loss of the discounted weights (the loss of the auxiliary classification is weighted at 0.3). At prediction time, these auxiliary networks are discarded.” –quote from the original paper

Starting in 2015, the number of layers in the network exploded, with the ’15-’17 winners having 152 layers, beginning the “depth revolution.”

ResNet is a very deep network that uses residual connections. Here are the details:

Is ResNet performing so well just because it’s deep? The answer is no; studies have shown that a 56-layer convolutional layer-stacked network has larger training and testing errors than a 20-layer network, and that it’s not overfitting that’s the cause, but rather that it’s harder to optimize a deeper network. But a deeper model can perform at least as well as a shallower one, and if you want to turn a shallower layer into a deeper one, you can build it in the following way: copy the original shallower layer into the deeper one, and add some mapping layers equal to itself. Now the deeper model can learn better.

ResNet learns the residualmapping between inputs and input-outputs by using multiple referential layers, instead of using referential layers to learn the underlyingmapping between inputs and outputs directly, as is done in general CNN networks (e.g. AlexNet/VGG, etc.).

If the input is set to X, and the mapping of a parametric network layer is set to H, then the output of that layer with X as input will be H(X). The usual CNN network will learn the expression of the parametric function H directly through training, thus directly obtaining the mapping of X to H(X). Residual learning, on the other hand, is devoted to learning the mapping of inputs to the residuals (H(X)-X) between inputs and outputs using multiple participatory network layers, i.e., learning X->(H(X)-X), and then adding X’s own mapping (identitymapping). That means the output of the network is still H(X)-X+X=H(X), just that the learning is only (H(X)-X), and the X part is directly itself mapping.

The residual learning unit establishes a direct correlation channel between inputs and outputs through the introduction of the per se mapping, thus allowing the powerful participant layer to concentrate on learning the residuals between inputs and outputs. Generally we use to denote the residual mapping, then the output of the residual learning unit is. When the number of input and output channels is the same, it is natural to directly use X for summation. When the number of channels between them is different, we need to consider the establishment of an effective self-mapping function so that the processed input X and output Y have the same number of channels.

When the number of channels between X and Y is different, there are two ways of mapping themselves. One is to simply zero out the missing channels of X relative to Y so that they can be aligned, and the other is to represent the Ws mapping by using a 1×1 convolution so that the final input and output channels are the same.

The complete network structure is as follows:

For the ResNet-50+ network, a GoogLeNet-like “bottleneck layer” is used for computational efficiency. Like the Inception module, the feature map dimension is subtly reduced or expanded by using 1×1 convolution so that the number of kernels in the 3×3 convolution is not affected by the inputs of the previous layer, and its output does not affect the next layer. However, it is designed purely to save computation time and thus reduce the time required to train the entire model, and has no impact on the final model accuracy.

The actual training of ResNet is as follows:

The actual training result is that a lot of layers can be stacked without loss of accuracy: 152 on ImageNet, 1202 on CIFAR. Now as expected, the deeper the network, the higher the training accuracy. Sweeping all the 2015 awards and exceeding human recognition rates for the first time.

The left graph below compares the accuracy of various networks by Top1 accuracy; the right graph shows the computational complexity of the different networks, with the horizontal axis being the amount of computation and the size of the circle indicating the memory footprint. Where Inception-v4 is Resnet+Inception.

The graph shows:

Forward propagation time and power consumption can also be compared:

How to draw a neural network diagram – how to draw a convolutional neural network diagram with visio. The graph is similar to the figure below

visio draw neural network diagram

open visio software, select the “network”, select a one to draw the type of network diagram, where the choice of “basic network diagram”

how to use visio to draw a simple network diagram

A rough idea of the function of the software

How to use visio to draw a simple network diagram

According to the prompts, the first draw a router and a switch

How to use visio to draw a simple network diagram

And then add a PC

How to use visio to draw a simple network diagram

What is a simple network diagram?

How to draw a simple network diagram using visio

Click on the “connectivity tool”

How to draw a simple network diagram using visio

Hold the mouse over the point with “x” and the color will automatically change to red. After connecting all three devices, a small, simple network diagram is complete!

How to use visio to draw a simple network connection diagram

How to use matlab to do neural network structure diagram

Give you an example, hope that through the example of the realization of neural network applications have some understanding.

%x,y are input and target vectors respectively



%Create a feedforward network

net=newff(minmax(x),[20,1],{‘tansig’,’purelin’ });

%Simulate the untrained network net and plot


%Use L-M optimization algorithm


%Set up the training algorithm

=500;=10^(-6 );

%Call the corresponding algorithm to train the BP network


%Simulate the BP network


%Calculate the simulation error



%Plot matching result curve



Execute the result

How to draw a graph of a convolutional neural network using visio. Graphics similar to the figure below

probably tried to draw this figure with visio, in addition to the left most deformed picture of the rest of the basic can be achieved (that figure can be considered using other image processing software such as Photoshop to generate inserted into visio), visio in the main graphics can be found in the more shapes – regular – with the effect of perspective in the block graphic, dragged into the Drawing area after pulling the perspective angle adjustment of the small red dot to adjust until appropriate, the rest of the block can hold down the ctrl + left mouse button to pull the copy, and then size, position carefully adjusted on it, roughly draw a graphic example shown below:

An article on four basic neural network architectures

Original link:

Just getting started with neural networks, you will often be confused by the many neural network architectures. Neural networks seem to be complex and diverse, but so many architectures are just three types, feed-forward neural networks, recurrent networks, symmetric connection networks, this article will introduce four common neural networks, respectively, CNN, RNN, DBN, GAN. through the four basic neural network architectures, let’s have a certain understanding of the neural network.

A neural network is a model in machine learning, an algorithmic mathematical model that mimics the behavioral characteristics of animal neural networks for distributed parallel information processing. This type of network relies on the complexity of the system to process information by adjusting the relationship between the large number of nodes interconnected within it.

In general, the architecture of neural networks can be divided into three categories:

Feed-forward neural networks:

This is the most common type of neural network used in practical applications. The first layer is the input and the last layer is the output. If there are multiple hidden layers, we call them “deep” neural networks. They compute a series of transformations that change the similarity of the samples. The activity of the neurons in each layer is a nonlinear function of the activity of the previous layer.

Recurrent networks:

Recurrent networks have loops oriented in their connection graphs, which means you can follow the arrows back to where you started. They can have complex dynamics that make them hard to train. They are more biologically realistic.

Recurrent networks are intended use to process sequential data. In a traditional neural network model, it’s from the input layer to the hidden layer to the output layer, and the layers are fully connected to each other, with unconnected nodes between each layer. But this ordinary neural network is incompetent for many problems. For example, if you want to predict what the next word in a sentence will be, you generally need to use the previous word, because the words before and after a sentence are not independent.

Recurrent neural networks, where the current output of a sequence is also related to the previous output. The network remembers the previous information and applies it to the computation of the current output, i.e., the nodes between hidden layers are no longer unconnected but connected, and the input to the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment.

Symmetric Connected Networks:

Actually, the previous post talked a bit about perceptual machines, so I’ll recap here.

First of all, it’s still this picture

This is an M-P neuron

A neuron has n inputs, each of which corresponds to a weight, w. Inside the neuron, it will sum the inputs with the weights by multiplying them and then summing them up, the result of the summing up will be done with the bias as a difference, and the result is eventually placed into an activation function, which will give the final output, which tends to be The output is often binary, with a 0 state representing inhibition and a 1 state representing activation.

The perceptron can be thought of as a hyperplane decision surface in an n-dimensional instance space, where the perceptron outputs 1 for samples on one side of the hyperplane, and 0 for instances on the other side, and this decision hyperplane equation is w⋅x=0. The set of positive and negative samples that can be partitioned by a hyperplane is called a linearlyseparable The set of samples can then be represented using the perceptual machine in Fig.

With, or, and non-problems are linearly separable problems that can be easily represented using a perceptron with two inputs, while different or is not a linearly separable problem, so using a single-layer perceptron does not work, and it is then necessary to use a multilayer perceptron to solve the puzzling problem.

What should we do if we want to train a perceptual machine?

We would start with random weights and repeatedly apply this perceptron to each training sample, modifying the perceptron’s weights whenever it misclassified a sample. Repeat this process until the perceptron correctly classifies all samples. Each step modifies the weights according to the perceptron training law, that is, modifying the weights wi corresponding to the input xi, which is as follows:

Here t is the target output of the current training sample, o is the output of the perceptron, and η is a positive constant known as the learning rate. The learning rate serves to moderate the extent to which the weights are adjusted at each step; it is usually set to a small value (e.g., 0.1) and is sometimes made to decay as the number of times the weights are adjusted increases.

Multilayer perceptual machines, or multilayer neural networks, are nothing more than multiple hidden layers between the input and output layers, and subsequent neural networks such as CNNs, DBNs, and so on, are nothing more than redesigned types of each layer. Perceptual machine can be said to be the basis of the neural network, the subsequent more complex neural networks are inseparable from the simplest model of the perceptual machine,

When it comes to machine learning, we tend to follow a word called pattern recognition, but the real environment of the pattern recognition will often appear a variety of problems. For example:

Image segmentation: real scenes are always mixed with other objects. It is difficult to determine which parts belong to the same object. Some parts of an object can be hidden behind other objects.

Object illumination: the intensity of pixels is strongly affected by light.

Image distortion: objects can be distorted in various non-affine ways. For example, handwriting can also have a large circle or just a pointed tip.

Situational support: the category to which objects belong is usually defined by how they are used. For example, chairs are designed for people to sit on, so they come in a variety of physical shapes.

The difference between a convolutional neural network and a regular neural network is that a convolutional neural network contains a feature extractor consisting of a convolutional layer and a subsampling layer. In the convolutional layer of a convolutional neural network, a neuron is connected to only some of its neighboring neurons. In a convolutional layer of a CNN, it usually contains a number of feature planes (featureMap), each feature plane consists of a number of neurons arranged in a rectangular shape, and neurons in the same feature plane share the weights, where the shared weights are the convolutional kernel. The convolution kernel is generally initialized in the form of a matrix of random fractions, and the convolution kernel will learn to obtain reasonable weights during the training process of the network. The immediate benefit of shared weights (convolution kernel) is to reduce the connectivity between the layers of the network while reducing the risk of overfitting. Sub-sampling is also called pooling and usually comes in the form of meanpooling and maxpooling. Sub-sampling can be seen as a special kind of convolution process. Convolution and subsampling greatly simplify the model complexity and reduce the parameters of the model.

The convolutional neural network consists of three parts. The first part is the input layer. The second part consists of a combination of n convolutional and pooling layers. The third part consists of a fully connected multilayer perceptron classifier.

Here’s an example of AlexNet:

-Input: 224×224 sized image, 3 channels

-First convolutional layer: 96 convolutional kernels of 11×11 size, 48 on each GPU.

-First layer max-pooling: 2×2 kernels.

-Second layer of convolution: 5×5 convolution kernels 256, 128 on each GPU.

-Second layer max-pooling: 2×2 kernels.

-Third layer convolution: fully connected to the previous layer, 384 convolution kernels in 3×3. Split to two GPUs 192.

– Fourth convolutional layer: 384 convolutional kernels of 3×3, 192 on each of the two GPUs. This layer is connected to the previous layer without going through a pooling layer.

– Fifth convolutional layer: 256 convolutional kernels of 3×3, 128 on each of the two GPUs.

– Layer 5 max-pooling: 2×2 kernels.

-First layer fully-connected: 4096 dimensions, connecting the output of the fifth max-pooling layer into a one-dimensional vector as input to that layer.

-Second fully connected layer: 4096 dimensions

-Softmax layer: the output is 1000, and each dimension of the output is the probability that the picture belongs to that category.

Convolutional neural networks have important applications in the field of pattern recognition, of course, here is only the simplest explanation of convolutional neural networks, convolutional neural networks still have a lot of knowledge, such as local sense of the field, the weights are shared, multiple convolutional kernels and other content, the subsequent opportunity to explain.

Traditional neural networks are difficult to deal with for many problems, for example, you want to predict what the next word in the sentence, usually need to use the previous word, because a sentence before and after the word is not independent. the reason why the RNN is called a recurrent neural network, that is, a sequence of the current output is also related to the output in front of. The specific form of expression is that the network will memorize the previous information and apply it to the calculation of the current output, i.e., the nodes between the hidden layers are no longer unconnected but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment. Theoretically, RNN is able to process sequence data of any length.

This is the structure of a simple RNN, and you can see that the hidden layer itself is able to connect to itself.

So why is the hidden layer of the RNN able to see the output of the hidden layer of the previous moment, in fact, we unfolded the network to open it is very clear.

From the equation above, we can see that the difference between the loop layer and the fully connected layer is that the loop layer has an additional weight matrix W.

If we repeatedly bring equation 2 into equation 1, we will get:

Before we talk about DBNs, we need to have some idea of the basic building block of DBNs, which is the RBM, the Restricted Boltzmann Machine.

First of all what is a Boltzmann machine?

[Image upload failed… (image-d36b31-1519636788074)]

A Boltzmann machine is shown in the figure with blue nodes for the hidden layer and white nodes for the input layer.

Boltzmann machine and recurrent neural networks, compared to the difference is reflected in the following points:

1, recurrent neural networks are essentially to learn a function, so there is the concept of input and output layers, while the Boltzmann machine is used to learn a set of data “intrinsic representation”, so it does not have the concept of output layers.


2. The nodes of a recurrent neural network are linked in a directed ring, while the nodes of a Boltzmann machine are linked in an undirected complete graph.

And what is a restricted Boltzmann machine?

In the simplest terms it is the addition of a restriction, and this restriction is what turns the complete graph into a bipartite graph. That is, it consists of a dominant layer and a hidden layer, with bi-directional full connections between neurons in the dominant and hidden layers.

h denotes the hidden layer and v denotes the explicit layer

In RBM, any two connected neurons have a weight w between them to indicate the strength of their connection, and each neuron itself has a bias coefficient b (for the explicit neuron) and c (for the implicit neuron) to indicate its own weight.

The exact derivation of the formulas is not shown here

DBN is a probabilistic generative model, as opposed to the traditional discriminative modeling of neural networks, where the generative model builds a joint distribution between observations and labels, where both P(Observation|Label) and P(Label|Observation ) are evaluated, while the discriminative model only evaluates only the latter, which is P(Label|Observation).

The DBN consists of multiple layers of Restricted Boltzmann Machines, a typical type of neural network shown in the figure. These networks are “restricted” to a visual layer and a hidden layer, with connections between the layers, but not between the units within the layers. The hidden layer units are trained to capture the correlation of higher-order data expressed in the visual layer.

Generative Adversarial Networks were actually explained in a previous post, so I’ll explain them here.

The goal of generative adversarial networks is to generate, and our traditional network structures tend to be discriminative models, i.e., judging the veracity of a sample. Generative models, on the other hand, are able to generate similar new samples based on the samples provided, note that these samples are learned by the computer.

GANs generally consist of two networks, the generative model network, and the discriminative model network.

The generative model G captures the distribution of the sample data, and generates a sample similar to the real training data with noise z obeying a certain distribution (uniform, Gaussian, etc.), pursuing the effect that the more it resembles the real samples, the better; the discriminative model D is a binary classifier estimating the probability that a sample comes from the training data (rather than from the generated data), and if the sample comes from the real training data, D outputs a large probability, otherwise, D outputs a small probability.

As an example: the generative network G is like a counterfeit currency manufacturing gang, specializing in manufacturing counterfeit currency, and the discriminative network D is like a police officer, specializing in detecting whether the currency used is real or counterfeit, G’s goal is to find ways to generate currency that is the same as the real currency, so that D can’t discriminate it, and D’s goal is to find ways to detect the counterfeit currency generated by G.

Traditional discriminative network:

Generative adversarial network:

The following shows an example of a cDCGAN (written in an earlier post)

Generative network

Discriminative network

The final result, using MNIST as the initial sample, the numbers generated by learning, you can see that the learning is still good.

This article is a very brief introduction to four neural network architectures, CNN, RNN, DBN, and GAN, but of course it’s only a brief introduction, and doesn’t go into great depth. These four neural network architectures are very common and widely used. Of course, about the knowledge of neural networks, it is not possible to explain the end of a few posts, the knowledge here to explain some of the basics, to help you quickly into (zhuang) door (bi). Later posts will be on the depth of the autoencoder, Hopfield network long short-term memory network (LSTM) to explain.

Building ResNet Convolutional Neural Networks

In 2015, Kaiming He’s team at Microsoft Research Asia released a special kind of convolutional neural network, the residual neural network (ResNet). Before the emergence of the residual neural network, the deepest deep neural network is only about 20 or 30 layers, but this neural network can easily reach hundreds or even thousands of layers in the experiment, in addition, will not take up too much training time, and because of this, the image recognition accuracy has been significantly enhanced. This model even won the champion of image classification, localization and detection in the ImageNet competition in the same year. Such excellent results in international competitions prove that the residual neural network is a practical and excellent model. In the experiment of binary classification of cats and dogs in this study, the classification model is also constructed based on residual neural network.

In this paper we will apply the kaggle cat and dog dataset to the ResNet-18 and ResNet-50 network models. Resnet is used to explore the current accuracy using convolutional neural networks. As Figure 4-1 shows the classic network structure diagram of ResNet – ResNet-18.

ResNet-18 are all composed of BasicBlock, and from Figure 4-2 it is also known that ResNet network models with 50 layers and above are composed of BottleBlock composition. In we then need to put our preprocessed dataset into the existing Resnet-18 and ResNet-50 models to train them, first we crop the training image to a square size of 96×96 by the image preprocessing mentioned earlier and then input it into our model, and here the structure of the network model of ResNet-18 is introduced, as the ResNet50 has a similar structure to the ResNet-34 model in Chapter Five.

The model structure of ResNet-18 is: first the first layer is a 7×7 convolution kernel,the input feature matrix is [112,112,64],after the convolution kernel 64, stride is 2 to get the in/out feature matrix [56,56,64]. The second layer consists of a 3×3 pooling layer at the beginning, followed by 2 residual structures, the input feature matrix at the beginning is [56,56,64], and the output feature matrix shape is needed to be [28,28,128], however, the main branch and shortcut must have the same output feature matrix shape, so the feature matrix of [56,56,64] The height and width of the feature matrix [56,56,64] is reduced from 56 to half of the original 28 by striding the main branch to 2, and then the depth of the feature matrix is changed by 128 convolution kernels. However, here the shortcut adds a 1×1 convolution kernel with a stride of 2. Through this stride, the width and height of the input feature matrix is reduced to half of the original one, and at the same time, the depth of the input feature matrix is changed to 128 through 128 convolution kernels. 128], the output feature matrix shape is [14,14,256], however, the main branch and the output feature matrix shape of the shortcut must be the same, so the height and width of the feature matrix of [14,14,256] are reduced from 14 to half of the original one, i.e., 7, by using the stride of the main branch of 2. The depth of the feature matrix is also changed by using 128 convolution kernels. The depth of the feature matrix is changed by 128 convolution kernels. However, here the shortcut adds a 1×1 convolution kernel, stride is also 2, through this stride, the width and height of the input feature matrix is also reduced to half of the original, and at the same time through the 256 convolution kernels to change the depth of the input feature matrix is also 256. the fourth layer, there are two residual structures, after the same process of change as described above to get the output feature matrix is [7]. matrix is [7,7,512]. In the fifth layer, there are 2 residual structures, and the output feature matrix is [1,1,512] after the same variation process described above. This is followed by average pooling and fully connected layers.

Convolutional Neural Network CNN (ConvolutionalNeuralNetwork)

The above figure calculates the process as, first of all, we can be called the right side of the convolution of the filter can also be called a nucleus, covering the left side of the first region, and then respectively, according to the corresponding position of the multiplication and then add, 3 * 1 +1 * 1 + 2 * 1 + 0 * 0 + 0 * 0 + 0 * 0 + 1 * (-1) + 8 * (-1) + 2 * (-1) = -5;

In accordance with the above calculation process. 0+1*(-1)+8*(-1)+2*(-1)=-5;

According to the above calculations, gradually press to move right by one step (the step can be set to 1,2,…), and then press to move down by one step (the step can be set to 1,2,…). etc.), and then press to move down, gradually calculating the corresponding value to arrive at the final value.

As shown above, for the first image matrix corresponding to the figure, one side is white, one side is black, then there will be a vertical edge in the middle, we can choose a vertical edge detection filter, such as multiplication of the right side of the matrix, then the two do the convolution of the resulting figure will be shown as the result of the right side of the equals sign matrix corresponding to the grayscale figure in the middle of the middle there will be a white intermediate band, that is, a white band. detected edge, then why do you feel that the middle edge band will be wider? Why does it feel like the center band is wider instead of a very thin local area? The reason is that our input image is only 6 * 6, too small, if we choose to output a larger size of the map, then the result is relatively a fine edge detection band, but also will be our vertical edge features extracted.

The above are manual selection of the parameters of the filter, with the development of neural networks we can use the back propagation algorithm to learn the parameters of the filter

We can turn the value of the convolutional caretaker into a parameter, through the back propagation algorithm to learn, so that the learned filter, or convolution kernel will be able to recognize a lot of features, rather than relying on the manual selection of filters. .

-Padding operation, convolution often has two problems:

1. The image shrinks with each convolution, and if there are many layers of convolution, the image behind it shrinks very small;

2. Edge pixels are utilized only once, which is obviously less than pixels located in the middle, and therefore the edge image information is lost.

In order to solve the above problem, we can fill the pixels at the edge of the image, which is called padding operation.

If we set the number of pixels to be padded at the edges of the image to be p, then the convolved image is: (n+2p-f+1)x(n+2p-f+1).

How to choose p

There are usually two choices:

-Valid: that is, no padding operation (nopadding), so if we have an image of nxn and a filter of fxf, then we convolve nxnfxf=(n-f+1)x(n-f+1) to the output image;

– Same: that is, no padding operation. Same: that is, after filling is the output image of the same size as the input, the same will have (n + 2p) x (n + 2p) fxf = nxn, then you can calculate, n + 2p-f +1 = n, get p = (f-1)/2.

Often for the choice of filters there is a default criterion for the selection of filters is to choose the filter size is an odd number of filters.


StridedCOnvolution is the length of the step that the filter moves each time we perform a convolution operation. The convolution operation we described above has a default step of 1, which means that each time we move the filter we move it one frame to the right, or one frame down.

But we can set the step size of the convolution, that is, we can set the number of frames that the convolution moves. Similarly, if our image is nxn, the filter is fxf, the padding is set to p, and the step size strided is set to s, then the output image after we perform the convolution operation is ((n+2p-f)/s+1)x((n+2p-f)/s+1), then a problem arises if the result of the calculation is not an integer how to do?

It is generally a convention to choose to round down, that is to say, to compute our filter only if it is completely on the image that can be covered.

In fact, the operation described above is not the definition of convolution from a strict mathematical point of view. The definition of convolution is that we need to mirror the convolution kernel or our filter before we move the step size, that is, before we multiply the corresponding elements, and then multiply the corresponding elements after the mirroring operation, which is a strictly convolutional operation. In mathematical terms, this operation is not strictly a convolution operation, it should be a mutual correlation operation, but in the field of deep learning, we have omitted the inversion operation by convention, and also call this operation a convolution operation

We know that the color image has three channels of RGB, and therefore the input is a three-dimensional input, so how do we perform convolution operation on a three-dimensional input image?

Example, such as the above figure we input image is assumed to be 6 × 6 × 3, 3 on behalf of the three channels of RGB channel, or can be called depth depth, the filter selection for the 3 × 3 × 3, which need to be specified is that the channel of the Guanxiao device must be the same as the channel of the input image, there is no restriction on the length and width of the process of calculating the then we will be filter’s stereo overlay on the input, so that the corresponding 27 numbers correspond to multiply and then add to get a number that corresponds to our output, so after convolution in this way we arrive at an output layer of 4 × 4 × 1. If we have more than one filter, for example, we use two filters respectively one to extract the vertical features, and one to extract the horizontal features then the output map 4 × 4 × 2. That is, the output map represents the depth or the channel of our The depth of the output or the number of channels and filters are equal.

The convolutional labeling of the lth layer is as follows:

Joining our filters is 3×3×3 specification, if we set 10 filters, then the total number of parameters to be learned is 27 parameters for each filter then add a bias bias then 28 parameters for each filter, so the ten filters are 280 parameters. From here it is also clear that we only need to calculate these parameters regardless of the size of our input image, so parameter sharing is easy to understand.

In order to reduce the size of the model, increase the speed of computation, and at the same time improve the robustness of the extracted features, we often use pooling layers. Pooling layers are computed in a similar way to convolution, except that we need to perform a pooling operation for each channel.

There are generally two types of pooling: MaxPooling and AveragePooling.

The above is MaxPooling, so the calculation method is similar to convolution, first set the hyperparameters such as the size of the filter and the step size, and then overlay to the corresponding grid, and use the maximum value to replace its value as the output, for example, the above figure is The filter is chosen to be 2×2, and the step size is chosen to be 2, so the output is a 2×2 dimension, and each output grid is the maximum value of the input on the corresponding dimension of the filter. If it is average pooling, then it is the average of the values in between that are chosen as the values for the output.

So from the process above we see that the pooling operation enables the model to be narrowed down, and at the same time enables the feature values to be more visible, which improves the robustness of the extracted features.

You still don’t understand convolutional neural network “feeling field” after reading this?

The concept of “receptive field” comes from biological neuroscience, for example, when our “receptors”, such as our hands, are stimulated, they will transmit the stimulus to the central nervous system, but not a single neuron can receive the whole skin stimulus, because the skin area is large. But not one neuron is able to receive the stimulation of the whole skin, because the skin area is large, one neuron can not be imagined to receive the whole, and we can feel the skin at the same time in different places on the body, such as hands, feet, different stimuli, such as pain, itch, etc. This shows that the skin receptors are made up of many different receptors, which are called skin receptors. This means that the skin receptors are controlled by many different neurons, then each neuron can reflect that piece of the receptor area is called the “sensory field”, the sensory field that each neuron innervates the region, it can also be said that the neuron’s activity is affected by that piece of the region.

In a convolutional neural network, the whole process of convolutional operation is similar to the skin stimulation process above, we can think of the original image as the receptor (skin), and the final output as the neuron that responds. What state of the final output (the state of the neuron) is affected by which region of the initial image (stimulated by that piece of skin) is not exactly the process described above? So we give the following definition of the sensory field:

The size of the region of the original image that the pixels on the featuremap output by each layer of the convolutional neural network map onto; in layman’s terms, that is, exactly which part of the original image is affected by each feature (each pixel) of the final output of the image.

In order to better illustrate the whole process of convolutional neural network, the following is an example, the size of the original image is 10×10, a total of five network layers are designed, the first four are convolutional layer, convolutional kernel size of 3×3, the last is the pooling layer, the size of the pooling layer is 2×2, in order to simplify the description of this time, all the step stride are 1.

Note: the feeling of the field in the calculation does not consider the “boundary padding”, because the boundary padding is not the original image itself, the feeling of the field describes the mapping relationship between the output features to the original image, so do not take padding into account. the actual modeling process may require padding boundary, the principle is the same, only the calculation is slightly more complex.

From the above, we can see: the first layer of the network output image, the output result is 8×8, output1 output of each feature (i.e., each pixel) is affected by the original image of the 3×3 region, so the first layer of the feeling field of 3, expressed in letters

RF1 = 3 (each pixel value is related to the original image of the 3×3 area) (each pixel value is related to the 3×3 region of the original image)

From the above figure, it can be seen that after two convolution operations, the final output image is 6×6, and each feature (i.e., each pixel) in the output2 output is affected by the range of 3×3 in the output1, and this 3×3 in the output1 receives the influence of the range of 5×5 in the original image Therefore, the receptive field of the second layer is 5, i.e.

RF2=5 (each pixel value is related to the 5×5 region of the original image)

From the above figure, it can be seen that after three convolutional operations, the final output image is 4×4, and each feature (i.e., each pixel) output by output3 is influenced by the range of output2 as 3×3, and this 3×3 in output1 receives the 5×5 range of the original image. range of output2 as 3×3, and this 3×3 in output2 is in turn affected by the 5×5 range of output1, and this 5×5 in output1 is in turn affected by the 7×7 range of the original image, so the receptive field of the third layer is 7, i.e.

RF3=7 (each pixel value is related to the original image of the 7×7 region)

From the above figure, it can be seen that after four convolution operations, the final output image is 2×2, each feature (i.e., each pixel) in the output of output4 is influenced by the range of output3 as 3×3, and this 3×3 in output3 is in turn influenced by the range of output2 as 5×5, and this 3×3 in output2 is This 5×5 in output2 is in turn affected by the range of 7×7 in output1, which in turn is affected by the range of 9×9 in the original graphic, so the fourth layer has a receptive field of 9, i.e.

RF4=9 (each pixel value is related to the 9×9 region of the original image)

< p>

From the above figure, it can be seen that after four convolution operations and one pooling operation, the final output image is 1×1, each feature (i.e., each pixel) of output5 output is affected by the range of output4 as 2×2, and this 2×2 in output4 is in turn affected by the range of output3 as 4×4, and This 4×4 in output3 is affected by the 6×6 range of output2, and this 6×6 in output2 is affected by the 8×8 range of output1, and this 8×8 in output1 is affected by the 10×10 range of the original image, so the receptive field of the fifth layer is 10, which is

< p> RF5=10 (each pixel value is related to a 10×10 area of the original image)

From the process above, it can be seen that the derivation of the receptive field is a recursive process, which is shown below.

RF1=3 k1 (the sensory field of the first layer, which is always equal to the size size of the first convolutional kernel) k denotes the first convolutional layer

RF2=5 k1+(k2-1) RF1+(k2-1)

RF3=7 k1+(k2-1)+(k3-1) RF2+(k3-1)

RF4=9 k1+(k2-1)+(k3-1) +(k4-1) RF3+(k4-1)

RF4=10 k1+(k2-1)+(k3-1) +(k4-1) +(k5-1) RF4+(k5-1)


But all the steps above are 1, what if the step stride of each convolution operation is not 1, ditto for the recursive formulas given directly here:

Where stride_n denotes the move stride of the nth convolution.

The solution process starts with RF1.