Structural diagram of a convolutional neural network

Lecture9 Convolutional Neural Network Architecture

First review LeNet-5, which has had great success in the field of digit recognition, with the network structure [CONV-POOL-CONV-POOL-FC-FC]. The convolutional layer uses a 5×5 convolutional kernel with a step size of 1; the pooling layer uses a 2×2 region with a step size of 2; and it is followed by a fully connected layer. This is shown in the following figure:

And AlexNet in 2012, the first large CNN network to win the ImageNet competition, has a very similar structure to LeNet-5, except that the number of layers has become more – [CONV1-MAXPOOL1-NORM1-CONV2 -MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MaxPOOL3-FC6-FC7-FC8], there are five convolutional layers, three pooling layers, two normalization layers and three fully connected layers. It is shown below:

The reason why it was split into two parts, top and bottom, was because the GPU capacity at that time was so small that it could only be done with two. Some more details are:

AlexNet improved its correctness rate by almost 10% when it won the ImageNet competition 2012, and the winner in 2013 was ZFNet, which used the same network architecture as AlexNet, only with further tuning of the hyperparameters:

This reduced the error rate from 16.4% to 11.7%

GoogLeNet and VGG, the winners and runners-up in ’14, have 22 and 19 layers, respectively; here’s how to describe each.

VGG uses smaller convolutional kernels and deeper layers compared to AlexNet.VGG has both 16 and 19 layers. The convolution kernel uses only 3×3 with a step of 1 and a pad of 1; the pooled region is 2×2 with a step of 2.

So why use a small 3×3 convolution kernel?

Here’s a look at the parameters and memory usage of VGG-16:

Some of the details of the VGG network are:

Here’s a look at the #1 name in classification, GoogLeNet.

First, some of the details of GoogLeNet:

” The “Inception” module is a well-designed LAN topology, which is then stacked on top of each other.

This topology applies a number of different filtering operations, such as 1×1 convolution, 3×3 convolution, 5×5 convolution, and 3×3 pooling, in parallel to the input from the previous layer. The outputs of all the filters are then concatenated together in depth. This is shown below:

But one problem with this structure is that the computational complexity is greatly increased. Consider, for example, the following network setup:

The inputs are 28x28x256 and the outputs of the concatenation are 28x28x672. (Assuming that each filtering operation maintains the input size by zero-padding.) And the computational expense is also very high:

Because the pooling operation maintains the depth of the original inputs, the network’s outputs are bound to increase in depth. The solution is to add a “bottleneck layer” before the convolution operation, which uses a 1×1 convolution to reduce the depth while preserving the size of the original input space, as long as the number of convolution kernels is less than the depth of the original input.

Using this structure, with the same network parameters, does reduce the amount of computation:

The final output is 28x28x480. The total amount of computation at this point is:

The Inceptionmole is stacked vertically, and for ease of description, the model is placed horizontally:

The total number of parameterized layers is therefore 3+18+1=22. +1 = 22 layers. In addition, the layers in the orange section are not counted in the total number of layers, and both pieces have the following structure:AveragePool5x5+3(V)-Conv1x1+1(S)-FC-FC-SoftmaxActivation-Output.The strong performance of this relatively shallow network on this classification task suggests that the middle layers of the network produce features that should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, we expect to encourage differentiation in the lower stages of the classifier, increase the returned gradient signal, and provide additional regularization. These auxiliary classifiers use smaller convolutional kernels placed on top of the output of the third and sixth Inceptionmole. During training, their loss is added to the total network loss of the discounted weights (the loss of the auxiliary classification is weighted at 0.3). At prediction time, these auxiliary networks are discarded.” –quote from the original paper

Starting in 2015, the number of layers in the network exploded, with the ’15-’17 winners having 152 layers, beginning the “depth revolution.”

ResNet is a very deep network that uses residual connections. Here are the details:

Is ResNet performing so well just because it’s deep? The answer is no; studies have shown that a 56-layer convolutional layer-stacked network has larger training and testing errors than a 20-layer network, and that it’s not overfitting that’s the cause, but rather that it’s harder to optimize a deeper network. But a deeper model can perform at least as well as a shallower one, and if you want to turn a shallower layer into a deeper one, you can build it in the following way: copy the original shallower layer into the deeper one, and add some mapping layers equal to itself. Now the deeper model can learn better.

ResNet learns the residualmapping between inputs and input-outputs by using multiple referential layers, instead of using referential layers to learn the underlyingmapping between inputs and outputs directly, as is done in general CNN networks (e.g. AlexNet/VGG, etc.).

If the input is set to X, and the mapping of a parametric network layer is set to H, then the output of that layer with X as input will be H(X). The usual CNN network will learn the expression of the parametric function H directly through training, thus directly obtaining the mapping of X to H(X). Residual learning, on the other hand, is devoted to learning the mapping of inputs to the residuals (H(X)-X) between inputs and outputs using multiple participatory network layers, i.e., learning X->(H(X)-X), and then adding X’s own mapping (identitymapping). That means the output of the network is still H(X)-X+X=H(X), just that the learning is only (H(X)-X), and the X part is directly itself mapping.

The residual learning unit establishes a direct correlation channel between inputs and outputs through the introduction of the per se mapping, thus allowing the powerful participant layer to concentrate on learning the residuals between inputs and outputs. Generally we use to denote the residual mapping, then the output of the residual learning unit is. When the number of input and output channels is the same, it is natural to directly use X for summation. When the number of channels between them is different, we need to consider the establishment of an effective self-mapping function so that the processed input X and output Y have the same number of channels.

When the number of channels between X and Y is different, there are two ways of mapping themselves. One is to simply zero out the missing channels of X relative to Y so that they can be aligned, and the other is to represent the Ws mapping by using a 1×1 convolution so that the final input and output channels are the same.

The complete network structure is as follows:

For the ResNet-50+ network, a GoogLeNet-like “bottleneck layer” is used for computational efficiency. Like the Inception module, the feature map dimension is subtly reduced or expanded by using 1×1 convolution so that the number of kernels in the 3×3 convolution is not affected by the inputs of the previous layer, and its output does not affect the next layer. However, it is designed purely to save computation time and thus reduce the time required to train the entire model, and has no impact on the final model accuracy.

The actual training of ResNet is as follows:

The actual training result is that a lot of layers can be stacked without loss of accuracy: 152 on ImageNet, 1202 on CIFAR. Now as expected, the deeper the network, the higher the training accuracy. Sweeping all the 2015 awards and exceeding human recognition rates for the first time.

The left graph below compares the accuracy of various networks by Top1 accuracy; the right graph shows the computational complexity of the different networks, with the horizontal axis being the amount of computation and the size of the circle indicating the memory footprint. Where Inception-v4 is Resnet+Inception.

The graph shows:

Forward propagation time and power consumption can also be compared:

Image Segmentation: Full Convolutional Neural Networks (FCN) Explained

As one of the three major tasks of computer vision (image classification, target detection, and image segmentation), image segmentation has seen significant development in recent years. This technique is also widely used in the field of unmanned vehicles, such as for recognizing passable areas, lane lines, and so on.

Fully Convolutional Networks (FCN) is a technique for image semantic segmentation proposed by Jonathan Long et al. of UCBerkeley in 2015 in the article FullyConvolutionalNetworksforSemanticSegmentation. a framework proposed for semantic segmentation of images. Although there have been many articles about this framework, I would like to organize my understanding here.

The overall network structure is divided into two parts: the fully convolutional part and the inverse convolutional part. The fully convolutional part borrows some classic CNN networks (such as AlexNet, VGG, GoogLeNet, etc.) and replaces the last fully connected layer with convolution, which is used to extract features and form hotspot maps; the inverse convolutional part up-samples the small-sized hotspot maps to get the original-sized semantic segmentation images.

The input of the network can be a color image of any size; the output is of the same size as the input, with the number of channels: n (number of target categories) + 1 (background).

The purpose of the network’s replacement of the CNN convolutional part with a fully-connected one is to allow the input image to be of any size above a certain size.

Since during convolution our heatmap becomes small (e.g., the length and width become that of the original image), in order to get a dense pixel prediction of the original image size we need to upsample.

An intuitive idea is to perform bilinear interpolation, which is easily achieved with backwardsconvolution by means of a fixed convolution kernel. Backwards convolution can also be referred to as deconvolution, and is often referred to as transposedconvolution in recent articles.

In practice, the authors do not fix the convolution kernel, but rather make the convolution kernel a learnable parameter.

If the up-sampling to the original size segmentation of the last layer of the feature map is performed using the up-sampling technique mentioned before, we will lose a lot of details due to the fact that the last layer of the feature map is too small. Thus, the authors propose to add Skips structure to combine the prediction of the last layer (with richer global information) and the prediction of the shallower layer (with more local details), so that the local prediction can be performed while adhering to the global prediction.

FCN still has some drawbacks, such as:

The results obtained are not yet fine enough and are not sensitive enough to details;

Pixel-to-pixel relationships are not taken into account, and there is a lack of spatial consistency.

Reference: zomi,Full Convolutional Network FCN in detail:Zhihu column article

Other related articles by the author:

PointNet: Deep Learning-based 3D Point Cloud Classification and Segmentation Model in Detail

Vision-Based Indoor Localization for Robots

Convolutional Neural Networks (CNN) Fundamentals

On Valentine’s Day, the seventh day of the seventh month of July, the day that the cowherd and the weaver meet, I finally learned CNN (from CS231n), and I feel a lot of feelings, so I quickly write it down, don’t forget to, and finally, I wish you all a happy Valentine’s Day 5555555. the main topic begins!

CNN has a total of Convolutional Layer (CONV), ReLU Layer (ReLU), Pooling Layer (Pooling), Fully Connected Layer (FC (FullConnection)) Below is a detailed explanation of each layer.

Convolution, especially the convolution of an image, requires a filter, with the filter to traverse the entire image, let’s assume that there is a 32 * 32 * 3 of the original image A, the size of the filter is 5 * 5 * 3, denoted by w, the data in the filter is a part of the parameters of the CNN, then in the use of the filter w filter A then, it can be expressed in the following equation:< /p>

Where x is a 5*5*3 part of the original image, and b is the bias term set to 1. After filtering A, the resulting data is a 28*28*1. So suppose we have six filters, which are independent of each other, i.e., the data within them are different and uncorrelated. It can be understood as one filter looks for the vertical edge of the whole image, one looks for the horizontal edge, one looks for the red color, and one looks for the black color. Then I can produce 6 28*28*1 data, and combine them together to produce 28*28*6 data, which is what the convolutional layer mainly does.

CNN can be seen as a series of convolutional layers and ReLU layers to process the original data structure of the neural network, the processing process can be expressed in the following diagram

Particularly important to note is that the depth of the filter must be the same as the depth of the previous layer of incoming data, such as the second convolutional layer of the above figure in the processing of incoming data 28 * 28 * 6 data to be used in 5 * 5 * 6 filter. 5*6 filter.

Filter in the image is constantly moving on the image filtering, there is a natural step problem, in the above example we cited are step 1, if the step is 3, 32 * 32 * 3 image through the 5 * 5 * 3 filter convolution to get the size of the (32-5)/3 +1 = 10, note: the step size can not be 2 because the (32-5)/2 +1 = 14.5 is a small number.

So when the image size is N, the filter size is F, the step size S, then the size of the convolution is (N-F)/S+1

We can see from the above figure that the length and width of the image is gradually reduced, after more than 5 layers will most likely be left with only 1 * 1 spatial scales, which is not good, and also not conducive to the calculation of our next, so We want to keep the image size unchanged on the spatial scale after the convolutional layers are processed, so we introduce the padtheborder operation. pad is actually to fill in the zeros around the image, expanding the size of the image, so that the size of the image remains unchanged after the convolutional layers are processed. In CNN, there are four main hyperparameters, the number of filters K, filter size F, pad size P and step size S, where P is an integer, when P = 1, the operation on the original data as shown in the figure:

Then the size of the convolved image after the pad operation is: (N-F + 2 * P)/S +1

And to make the spatial scale of the convolutional image unchanged, the value of P can be set to the value of (N-F + 2 * P) / S +1

And to make the convolutional layer processed image spatial scale unchanged, the value of P can be set to P=(F-1)/2

Convolution layer input W1*H1*D1 size data, output W2*H2*D2 data, at this time the convolution layer has a total of four hyperparameters:

K: number of filters

P: pad attribute value

S: the step size of the filter each time it moves

F: Filter size

The size of the output at this point can be calculated using the inputs and hyperparameters:




1*1 filters also make sense, which do a convolution in the direction of the depth, e.g., a 1*1*64 filter convolves 56*56*64 to get 56*56 data

F is usually an odd number so that data from the top, bottom, left, and right directions can be combined.

Convolutional layers can have two properties when viewed from a neuron’s perspective: parameter sharing and local connectivity. Treating a filter, for example, 5 * 5 * 3 a filter, the 32 * 32 * 3 data convolution to get 28 * 28 data, can be regarded as the existence of 28 * 28 neurons, each of the original image 5 * 5 * 3 region of the computation, this 28 * 28 neurons due to the use of the same filter, so the parameters are the same, we call this property as parameter sharing.

For different filters, we can see that they will see the same region of the image, which is equivalent to the existence of multiple neurons in the depth direction, and they look at the same region is called local connectivity

Parameter sharing reduces the number of parameters, and prevents overfitting

Local connectivity offers the possibility of finding a richer representation of different features of the images offers the possibility.

Convolution is like another representation of the original image.

Activation function, for each dimension after the ReLU function output is sufficient. Does not change the spatial scale of the data.

Through the pad operation, the output image in the control does not change, but the depth of the change, more and more huge data to the calculation of the difficulties, but also the appearance of redundant features, so the need for pooling operations, pooling does not change the depth, only change the length and width of the two main methods of the maximum and average value, the general pooling filter size F for the 2-step length of 2, for the maximum value of pooling Can be clearly represented by the following image:

Convolution layer input W1 * H1 * D1 size data, output W2 * H2 * D2 data, at this point in the convolution layer has a total of 2 hyperparameters:

S: the filter each time you move the step length

F: the filter size

This point in the output of the size of the input and hyperparameter can be obtained:

W2 = W2 = W1 * H2 * D2, the output size can be obtained by using the input and hyperparameters:




The data processed by the last layer (CONV, ReLU or Pool) will be inputted into the fully-connected layer, and for the W2*H2*D2 data, we will expand it into 1*1*W2*H2*D2 size data, and input the layer has a total of W2*H2*D2 neurons, and finally the size of the output layer is determined according to the problem, and the output layer can be represented by softmax. In other words, the fully connected layer is a common BP neural network. And this network is also the part with the most parameters, the part that you want to remove next. The complete neural network can be represented by the following diagram:


1. smaller filter with deeper network

2. only CONV layer and remove pooling and full links

The earliest CNN, used to recognize zip codes, has the following structure:


Filter size 5*5 with step 1, pooling layer 2*2 with step 2

In 2012, due to the limitation of GPU technology, the original AlexNet was computed separately for two GPUs. The combined structure is presented here.

The input image is 227*227*3

1. ReLU is used for the first time

2. Normlayers are used, which are now discarded because they are not very effective

3. The data is preprocessed (e.g., size change, color change, etc.)

4. Deactivation ratio is 0.5

5. Batchsize128

6. SGDMomentum parameter 0.9 (see my other posts for SGD and Momentum)

7. Learning rate 0.01, accuracy is reduced by a factor of 10 when not boosting, and convergence is reached after 1-2 times

8. L2 weight reduction 0.0005

9. Error rate 15.4%

Improvement since AlexNet, the main changes:

1.CONV1 filter from 11 * 11 step S = 4 to 7 * 7 step 2.

2.CONV3,4,5 filter number of 384, 384, 256 changed to 512, 1024, 512 (the number of filters for the 2 n times) power is conducive to computer calculations can improve efficiency)

Error rate: 14.8% after continued improvement to 11.2%

Currently the best and easiest to use CNN network, all the convolutional layer filters are 3*3 in size, with a step size of 1, pad=1, and pooling layer is a maximal pooling of 2*2, with S=2.

The main parameter is from the fully-connected layer, which is also the reason for wanting to remove FC.

Has a high degree of uniformity and linear combination, easy to understand, very convenient to have VGG-16, VGG-19 and other structures.

Error rate of 7.3%

Completely remove the FC layer, the parameters are only 5 million, using the Inception module (do not quite understand, have time to continue to see)

Accuracy of 6.67%

Accuracy of 3.6%

Have a very deep network structure and the deeper it is the higher the accuracy. It is a feature that traditional CNN does not have, traditional CNN is not the deeper the more accurate. Need longer training time but faster than VGG

1.Each convolutional layer using BatchNormalization

2.Xavier/2 initialization

3.SGD+Momentum (0.9)

4.Learningrate:0.1, the accuracy remains unchanged reduced by 10 times (because of BatchNormalization). times (larger than AlexNet because of BatchNormalization)


7.Not applicable to deactivation (because of BatchNormalization)

< p>The specific gradient process after learning ResNet.

An article on four basic neural network architectures

Original link:

Just getting started with neural networks, you will often be confused by the many neural network architectures. This article will introduce four common neural networks, namely CNN, RNN, DBN, and GAN. through these four basic neural network architectures, we will have a certain understanding of neural networks.

A neural network is a model in machine learning, an algorithmic mathematical model that mimics the behavioral characteristics of animal neural networks for distributed parallel information processing. This type of network relies on the complexity of the system to process information by adjusting the relationship between the large number of nodes interconnected within it.

In general, the architecture of neural networks can be divided into three categories:

Feed-forward neural networks:

This is the most common type of neural network used in practical applications. The first layer is the input and the last layer is the output. If there are multiple hidden layers, we call them “deep” neural networks. They compute a series of transformations that change the similarity of the samples. The activity of the neurons in each layer is a nonlinear function of the activity of the previous layer.

Recurrent networks:

Recurrent networks have loops oriented in their connection graphs, which means you can follow the arrows back to where you started. They can have complex dynamics that make them hard to train. They are more biologically realistic.

Recurrent networks are intended use to process sequential data. In a traditional neural network model, it’s from the input layer to the hidden layer to the output layer, and the layers are fully connected to each other, with unconnected nodes between each layer. But this ordinary neural network is incompetent for many problems. For example, if you want to predict what the next word in a sentence will be, you generally need to use the previous word, because the words before and after a sentence are not independent.

Recurrent neural networks, where the current output of a sequence is also related to the previous output. The network remembers the previous information and applies it to the computation of the current output, i.e., the nodes between hidden layers are no longer unconnected but connected, and the input to the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment.

Symmetric Connected Networks:

Actually, the previous post talked a bit about perceptual machines, so I’ll recap here.

First of all, it’s still this picture

This is an M-P neuron

A neuron has n inputs, each of which corresponds to a weight, w. Inside the neuron, it will sum the inputs with the weights by multiplying them and then summing them up, the result of the summing up will be done with the bias as a difference, and the result is eventually placed into an activation function, which will give the final output, which tends to be The output is often binary, with a 0 state representing inhibition and a 1 state representing activation.

The perceptron can be thought of as a hyperplane decision surface in an n-dimensional instance space, where the perceptron outputs 1 for samples on one side of the hyperplane, and 0 for instances on the other side, and this decision hyperplane equation is w⋅x=0. The set of positive and negative samples that can be partitioned by a hyperplane is called a linearlyseparable The set of samples can then be represented using the perceptual machine in Fig.

With, or, and non-problems are linearly separable problems that can be easily represented using a perceptron with two inputs, while different or is not a linearly separable problem, so using a single-layer perceptron does not work, and it is then necessary to use a multilayer perceptron to solve the puzzling problem.

What should we do if we want to train a perceptual machine?

We would start with random weights and repeatedly apply this perceptron to each training sample, modifying the perceptron’s weights whenever it misclassified a sample. Repeat this process until the perceptron correctly classifies all samples. Each step modifies the weights according to the perceptron training law, that is, modifying the weights wi corresponding to the input xi, which is as follows:

Here t is the target output of the current training sample, o is the output of the perceptron, and η is a positive constant known as the learning rate. The learning rate serves to moderate the extent to which the weights are adjusted at each step; it is usually set to a small value (e.g., 0.1) and is sometimes made to decay as the number of times the weights are adjusted increases.

Multilayer perceptual machines, or multilayer neural networks, are nothing more than multiple hidden layers between the input and output layers, and subsequent neural networks such as CNNs, DBNs, and so on, are nothing more than redesigned types of each layer. Perceptual machine can be said to be the basis of the neural network, the subsequent more complex neural networks are inseparable from the simplest model of the perceptual machine,

When it comes to machine learning, we tend to follow a word called pattern recognition, but the real environment of the pattern recognition will often appear a variety of problems. For example:

Image segmentation: real scenes are always mixed with other objects. It is difficult to determine which parts belong to the same object. Some parts of an object can be hidden behind other objects.

Object illumination: the intensity of pixels is strongly affected by light.

Image distortion: objects can be distorted in various non-affine ways. For example, handwriting can also have a large circle or just a pointed tip.

Situational support: the category to which objects belong is usually defined by how they are used. For example, chairs are designed for people to sit on, so they come in a variety of physical shapes.

The difference between a convolutional neural network and a regular neural network is that a convolutional neural network contains a feature extractor consisting of a convolutional layer and a subsampling layer. In the convolutional layer of a convolutional neural network, a neuron is connected to only some of its neighboring neurons. In a convolutional layer of a CNN, it usually contains a number of feature planes (featureMap), each feature plane consists of a number of neurons arranged in a rectangular shape, and neurons in the same feature plane share the weights, where the shared weights are the convolutional kernel. The convolution kernel is generally initialized in the form of a matrix of random fractions, and the convolution kernel will learn to obtain reasonable weights during the training process of the network. The immediate benefit of shared weights (convolution kernel) is to reduce the connectivity between the layers of the network while reducing the risk of overfitting. Sub-sampling is also called pooling and usually comes in the form of meanpooling and maxpooling. Sub-sampling can be seen as a special kind of convolution process. Convolution and subsampling greatly simplify the model complexity and reduce the parameters of the model.

The convolutional neural network consists of three parts. The first part is the input layer. The second part consists of a combination of n convolutional and pooling layers. The third part consists of a fully connected multilayer perceptron classifier.

Here’s an example of AlexNet:

-Input: 224×224 sized image, 3 channels

-First convolutional layer: 96 convolutional kernels of 11×11 size, 48 on each GPU.

-First layer max-pooling: 2×2 kernels.

-Second layer of convolution: 5×5 convolution kernels 256, 128 on each GPU.

-Second layer max-pooling: 2×2 kernels.

-Third layer convolution: fully connected to the previous layer, 384 convolution kernels in 3×3. Split to two GPUs 192.

– Fourth convolutional layer: 384 convolutional kernels of 3×3, 192 on each of the two GPUs. This layer is connected to the previous layer without going through a pooling layer.

– Fifth convolutional layer: 256 convolutional kernels of 3×3, 128 on each of the two GPUs.

– Layer 5 max-pooling: 2×2 kernels.

-First layer fully-connected: 4096 dimensions, connecting the output of the fifth max-pooling layer into a one-dimensional vector as input to that layer.

-Second fully connected layer: 4096 dimensions

-Softmax layer: the output is 1000, and each dimension of the output is the probability that the picture belongs to that category.

Convolutional neural networks have important applications in the field of pattern recognition, of course, here is only the simplest explanation of convolutional neural networks, convolutional neural networks still have a lot of knowledge, such as local sense of the field, the weights are shared, multiple convolutional kernels and other content, the follow-up opportunity to explain.

Traditional neural networks are difficult to deal with for many problems, for example, you want to predict what the next word in the sentence, usually need to use the previous word, because a sentence before and after the word is not independent. the reason why the RNN is called a recurrent neural network, that is, a sequence of the current output is also related to the output of the previous. The specific form of expression is that the network will memorize the previous information and apply it to the calculation of the current output, i.e., the nodes between the hidden layers are no longer unconnected but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment. Theoretically, RNN is able to process sequence data of any length.

This is the structure of a simple RNN, and you can see that the hidden layer itself is able to connect to itself.

So why the hidden layer of the RNN can see the output of the hidden layer of the previous moment, in fact, we unfolded the network to open it is very clear.

From the equation above, we can see that the difference between the loop layer and the fully connected layer is that the loop layer has an additional weight matrix W.

If we repeatedly bring equation 2 into equation 1, we will get:

Before we talk about DBNs, we need to have some idea of the basic building block of DBNs, which is the RBM, the Restricted Boltzmann Machine.

First of all what is a Boltzmann machine?

[Image upload failed… (image-d36b31-1519636788074)]

A Boltzmann machine is shown in the figure with blue nodes for the hidden layer and white nodes for the input layer.

Boltzmann machine and recurrent neural networks, compared to the difference is reflected in the following points:

1, recurrent neural networks are essentially to learn a function, so there is the concept of input and output layers, while the Boltzmann machine is used to learn a set of data “intrinsic representation”, so it does not have the concept of output layers.


2. The nodes of a recurrent neural network are linked in a directed ring, while the nodes of a Boltzmann machine are linked in an undirected complete graph.

And what is a restricted Boltzmann machine?

In the simplest terms it is the addition of a restriction, and this restriction is what turns the complete graph into a bipartite graph. That is, it consists of a dominant layer and a hidden layer, with bi-directional full connections between neurons in the dominant and hidden layers.

h denotes the hidden layer and v denotes the explicit layer

In RBM, any two connected neurons have a weight w between them to indicate the strength of their connection, and each neuron itself has a bias coefficient b (for the explicit neuron) and c (for the implicit neuron) to indicate its own weight.

The exact derivation of the formulas is not shown here

DBN is a probabilistic generative model, as opposed to the traditional discriminative modeling of neural networks, where the generative model builds a joint distribution between observations and labels, where both P(Observation|Label) and P(Label|Observation ) are evaluated, while the discriminative model only evaluates only the latter, which is P(Label|Observation).

The DBN consists of multiple layers of Restricted Boltzmann Machines, a typical type of neural network shown in the figure. These networks are “restricted” to a visible layer and a hidden layer, with connections between the layers, but not between the units within the layers. The hidden layer units are trained to capture the correlation of higher-order data expressed in the visual layer.

Generative Adversarial Networks were actually explained in a previous post, so I’ll explain them here.

The goal of generative adversarial networks is to generate, and our traditional network structures tend to be discriminative models, i.e., judging the veracity of a sample. Generative models, on the other hand, are able to generate similar new samples based on the samples provided, note that these samples are learned by the computer.

GANs generally consist of two networks, the generative model network, and the discriminative model network.

The generative model G captures the distribution of the sample data, and generates a sample similar to the real training data with noise z obeying a certain distribution (uniform, Gaussian, etc.), pursuing the effect that the more it resembles the real samples, the better; the discriminative model D is a binary classifier estimating the probability that a sample comes from the training data (rather than from the generated data), and if the sample comes from the real training data, D outputs a large probability, otherwise, D outputs a small probability.

As an example: the generative network G is like a counterfeit currency manufacturing gang, specializing in manufacturing counterfeit currency, and the discriminative network D is like a police officer, specializing in detecting whether the currency used is real or counterfeit, G’s goal is to find ways to generate currency that is the same as the real currency, so that D can’t discriminate it, and D’s goal is to find ways to detect the counterfeit currency generated by G.

Traditional discriminative network:

Generative adversarial network:

The following shows an example of a cDCGAN (written in an earlier post)

Generative network

Discriminative network

The final result, using MNIST as the initial sample, and the numbers generated by learning, you can see that the learning is still good.

This article is a very brief introduction to four neural network architectures, CNN, RNN, DBN, and GAN, but of course it’s only a brief introduction, and doesn’t go into great depth. These four neural network architectures are very common and widely used. Of course, about the knowledge of neural networks, it is not possible to explain the end of a few posts, the knowledge here to explain some of the basics, to help you quickly into (zhuang) door (bi). The latter post will be on the depth of the autoencoder, Hopfield network long short-term memory network (LSTM) to explain.