Introduction to CNN Networks
Introduction to Convolutional Neural Networks (CNN)
Convolutional neural networks (CNNs) are a highly efficient method of recognition that has been developed in recent years and has attracted widespread attention.In the 1960s, Hubel and Wiesel, in their study of neurons in cat cerebral cortex that are used for localized In the 1960s, Hubel and Wiesel discovered that the unique network structure of neurons in the cat brain cortex for local sensitivity and direction selection could effectively reduce the complexity of the feedback neural network, and then proposed Convolutional
Neural
Networks – abbreviated as CNN. Nowadays, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, which is more widely used because the network avoids the complex pre-processing of images and can be directly input to the original image.
Generally, the basic structure of a CNN consists of two layers, one of which is a feature extraction layer, where the input of each neuron is connected to the local receptive domain of the previous layer and extracts the features of that localization. Once this local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computational layer of the network consists of multiple feature mappings, and each feature mapping is a plane, and the weights of all neurons on the plane are equal. The feature mapping structure uses a sigmoid function with a small kernel of influence function as the activation function of the convolutional network, which makes the feature mapping displacement invariant. In addition, the number of free parameters of the network is reduced because neurons on a mapping surface share weights. Each convolutional layer in a convolutional neural network is immediately followed by a computational layer used for local averaging and quadratic extraction, and this characteristic two-feature extraction structure reduces the feature resolution.
CNNs are primarily used to recognize two-dimensional shapes with displacement, scaling, and other forms of distorted invariance. Since the feature detection layer of CNNs learns through the training data, feature extraction on display is avoided when using CNNs, which implicitly learn from the training data; furthermore, since the neurons on the same feature mapping surface have the same weights, the network can learn in parallel, which is one of the major advantages of convolutional networks over networks in which neurons are connected to each other. Convolutional neural network with its special structure of local weight sharing has a unique superiority in speech recognition and image processing, its layout is closer to the actual biological neural network, weight sharing reduces the complexity of the network, especially multi-dimensional input vector image can be directly input into the network this feature avoids the complexity of data reconstruction in the process of feature extraction and classification.
1. neural network
First of all, the neural network is introduced, the details of this step can be referred to Resource 1. briefly described. Each unit of the neural network is as follows:
The corresponding formula is as follows:
Where the unit can also be referred to as a Logistic regression model. When multiple units are combined and have a hierarchical structure, a neural network model is formed. The figure below shows a neural network with one hidden layer.
The corresponding formula is as follows:
Comparing similarly, it can be expanded to have 2,3,4,5, … hidden layers.
The training method of the neural network is also similar to Logistic, however, due to its multi-layer nature, it also needs to utilize the chain derivation law for the nodes of the implied layers, i.e., gradient descent + chain derivation law, which is professionally known as backpropagation. Regarding the training algorithm, it is not covered in this paper for the time being.
2 Convolutional Neural Networks
In image processing, images are often represented as vectors of pixels, for example, a 1000×1000 image can be represented as a vector of 1000000. In the neural network mentioned in the previous section, if the number of hidden layers is the same as the number of input layers, i.e., also 1000000, then the parameter data from the input layer to the hidden layer is 1000000×1000000=10^12, which is too much and basically impossible to train. So image processing in order to practice the neural network method, must first reduce the parameters to speed up. It is similar to the sword recipe of the evil spirits, ordinary people practicing very frustrating, once the self Palace after the internal strength of the sword becomes stronger and faster, it becomes a very bull.
2.1 Local Perception
Convolutional neural networks have two kinds of artifacts that can reduce the number of parameters, the first is called local perception wild. It is generally believed that human perception of the outside world is from local to global, and the spatial connection of the image is also localized pixels are more closely connected, while pixels farther away are less relevant. Thus, it is not really necessary for each neuron to perceive the global image, but only the local one, and then the global information is obtained by synthesizing the local information at a higher level. The idea of partial network connectivity is also inspired by the structure of the visual system in biology. Neurons in the visual cortex receive information locally (i.e., they only respond to stimuli in specific areas). This is shown in the figure below: full connections on the left and local connections on the right.
In the above right figure, if each neuron is connected to only 10×10 pixel values, the weighting data is 1,000,000×100 parameters, reduced to one thousandth of the original. And that 10×10 pixel values corresponding to 10×10 parameters is actually equivalent to a convolution operation.
2.2 Parameter Sharing
But actually in that case there are still too many parameters, so start the second level of artifacts, i.e., weights sharing. In the local connection above, each neuron corresponds to 100 parameters, a total of 1,000,000 neurons, if all 100 parameters of these 1,000,000 neurons are equal, then the number of parameters becomes 100.
How to understand weight sharing? We can think of these 100 parameters (that is, the convolution operation) as a way of extracting features in a way that is independent of position. The implicit principle is that one part of the image has the same statistical properties as the rest. This means that the features we learn in one part of the image can be used in another part of the image, so we can use the same learned features for all positions in the image.
More intuitively, when a small piece of a large-sized image, say 8×8, is randomly selected as a sample, and some features are learned from this small sample, we can then apply the features learned from this
8×8 sample as a detector to any place in this image. In particular, we can convolve the features learned from the 8×8
sample with the original large-size image to obtain an activation value for a different feature for any location on the large-size image.
The following figure shows the convolution of a 33 convolution kernel on a 55 image. Each convolution is a form of feature extraction, acting like a sieve to filter out the parts of the image that match the conditions (the higher the activation value, the better the conditions).
2.3 Multiple Convolution Kernels
When there are only 100 parameters as described above, it shows that there is only 1 convolution kernel of 100*100, obviously, the feature extraction is not sufficient, we can add multiple convolution kernels, such as 32 convolution kernels, which can learn 32 kinds of features. When there are multiple convolution kernels, as shown in the following figure:
On the right of the above figure, different colors indicate different convolution kernels. Each convolutional kernel generates an image into another image. For example, two convolutional kernels can generate two images, which can be seen as different channels of a single image. As shown in the figure below, there is a small error in the figure below, i.e. just change w1 to w0 and w2 to w1. They will still be referred to as w1 and w2 in the following.
The following figure shows a convolution operation on four channels, with two convolution kernels, generating two channels. One of the things to note is that each of the four channels corresponds to a convolution kernel, and by first ignoring w2 and only looking at w1, then the value at a position (i,j) of w1 is arrived at by adding up the results of the convolution at (i,j) on the four channels and then taking the activation function value.
So, in the above process of convolving from 4 channels to get 2 channels, the number of parameters is 4×2×2×2, where 4 means 4 channels, the first 2 means generating 2 channels, and the last 2×2 means the convolution kernel size.
2.4Down-pooling
After obtaining the features
through convolution, the next step is that we want to use these features to do classification. Theoretically, one can use all the extracted features to train a classifier such as softmax
classifier, but doing so faces computational challenges. For example, for a 96X96
pixel image, assuming that we have learned to get 400 features defined on an 8X8 input, each feature and image convolution results in a (96-8+1) x (96-8+1) = 7921 dimensional convolutional feature. Since there are 400 features, each sample (example) yields a convolutional feature vector of 892 x 400 = 3,168,400 dimensions. Learning a classifier with more than 3 million feature inputs is inconvenient and prone to over-fitting.
In order to solve this problem, first recall that we decided to use convolved features because images have a property of “staticity”, which means that a feature that is useful in one region of the image is very likely to be equally useful in another region. Therefore, in order to characterize large images, a natural idea is to perform summary statistics on features at different locations, for example, one can calculate the mean (or maximum) value of a particular feature in a region of the image. These summary statistics features will not only have much lower dimensionality (compared to using all the features obtained by extraction), but will also improve the results (less prone to overfitting). This aggregation operation is called pooling, sometimes called average pooling or maximum pooling (depending on the method used to compute the pooling).
At this point, the basic structure and principles of convolutional neural networks have been described.
2.5 Multi-layer convolution
In practice, often use multi-layer convolution, and then use the fully-connected layer for training. The purpose of multi-layer convolution is that the features learned by one layer of convolution tend to be localized, and the higher the number of layers, the more global the features learned.
3ImageNet-2010 Network Architecture
ImageNetLSVRC is an image categorization competition with a training set consisting of 127W+ images, a validation set with 5W images, and a test set with 15W images. This paper illustrates an intercept of AlexKrizhevsky’s CNN structure that won the competition in 2010 with a top-5 error rate of 15.3%. It is worth mentioning that in this year’s ImageNetLSVRC competition, GoogNet, which won the championship, has reached a top-5 error rate of 6.67%. It can be seen that there is still a huge room for improvement in deep learning.
The figure below shows Alex’s CNN structure. Note that the model uses a 2-GPU parallel structure, i.e., the 1st, 2nd, 4th, and 5th convolutional layers are all trained by dividing the model parameters into 2 parts. Here, furthermore, the parallel structure is divided into data parallelism and model parallelism. Data parallelism means that on different GPUs, the model structure is the same, but the training data is sliced and trained separately to get different models, which are then fused. Model parallelism, on the other hand, is that the model parameters of a number of layers are sliced, the same data is used for training on different GPUs, and the results obtained are directly connected as inputs to the next layer.
The basic parameters of the model above are:
Input: 224×224 sized image, 3 channels
First layer of convolution: 96 convolution kernels of 5×5 size, 48 on each GPU.
First layer of max-pooling: 2×2 kernels.
Second layer of convolution: 256 3×3 convolution kernels, 128 on each GPU.
Second layer max-pooling: 2×2 kernels.
Third layer of convolution: fully connected to the previous layer, 384 convolution kernels in 3×3. Split to two GPUs 192.
The fourth convolutional layer: 384 convolutional kernels of 3×3, 192 on each of the two GPUs. This layer is connected to the previous layer without going through a pooling layer.
Fifth convolutional layer: 256 convolutional kernels of 3×3, 128 on each of the two GPUs.
Fifth layer max-pooling: 2×2 kernels.
First fully-connected layer: 4096 dimensions, connecting the output of the fifth layer of max-pooling into a one-dimensional vector that serves as the input to that layer.
The second fully-connected layer: 4096 dimensions
The Softmax layer: the output is 1000, and each dimension of the output is the probability that the picture belongs to that category.
4DeepID network structure
The DeepID network structure is a convolutional neural network developed by Sun
Yi at the Chinese University of Hong Kong to learn the features of faces. Each input face was represented as a 160-dimensional vector, and the learned vectors were classified by other models, which got 97.45% correct on the face validation test, and furthermore, the original authors improved the CNN and got another 99.15% correct.
As you can see below, the structure is similar to ImageNet in terms of specific parameters, so let’s just explain the different parts.
The structure in the figure above has only one fully connected layer at the end, and then a softmax layer. It is this fully connected layer that is used as the representation of the image in the paper. In the fully connected layer, the output of the fourth convolutional layer and the third max-pooling layer are used as inputs to the fully connected layer, so that both local and global features can be learned.
LANL researchers demonstrate the absence of a ‘barren plateau’ in quantum convolutional neural networks
Edit|Radish Skin
With the advent of quantum computers, a number of different architectures have been proposed that can offer advantages over their classical counterparts. Quantum neural networks (QNNs) are one of the most promising architectures, with applications ranging from physics simulations to optimization and more general machine learning tasks. Despite its great potential, QNNs have been shown to exhibit ‘barren plateaus’ in which the gradient of the cost function vanishes exponentially with the size of the system, preventing the architecture from being trained for large problems.
Here, the LosAlamosNationalLaboratory (LANL), in collaboration with researchers at the University of London, demonstrates that a specific QNN architecture does not suffer from a barren plateau.
The researchers analyzed an architecture known as a quantum convolutional neural network (QCNN), which has recently been proposed for solving classification problems with quantum data. For example, QCNNs can be trained to classify substances according to the relative quantum state of the matter to which they belong. And the researchers demonstrated that QCNNs are not affected by barren plateaus, thus highlighting them as potential candidate architectures for realizing quantum advantages in the short term.
The study was published in PHYSICALREVIEWX on October 15, 2021, under the title “AbsenceofBarrenPlateausinQuantumConvolutionalNeuralNetworks”.
QNN has generated interest around the possibility of effectively analyzing quantum data. However, this excitement has been tempered by the presence of exponentially vanishing gradients (called barren plateau landscapes) for many QNN architectures. More recently, QCNNs have been proposed that involve a series of convolutional and pooling layers that reduce the number of quantum bits while retaining information about the characteristics of the data.
Schematic of QCNN.
In this work, the researchers rigorously analyzed the gradient scaling of parameters in the QCNN architecture. It turned out that the variance of the gradient disappeared no faster than the polynomials, meaning that QCNN did not exhibit a barren plateau. The results provide analytical guarantees for the trainability of randomly initialized QCNNs, which highlights that QCNNs are trainable under random initialization. This is unlike many other QCNN architectures.
To arrive at the results, the researchers introduced a new graph-based method to analyze the expected value of Haar’s distributed Mississippi; this may be useful in other cases; in addition, the researchers performed numerical simulations to validate the analysis.
Tensor network representation of QCNN.
As an artificial intelligence method, QCNNs are inspired by the visual cortex. As such, they involve a series of convolutional layers or filters that are interleaved with pooling layers to reduce the dimensionality of the data while maintaining the important features of the data set. These neural networks can be used to solve a range of problems, from image recognition to material discovery. Overcoming the barren plateau is key to unlocking the full potential of quantum computers for AI applications and demonstrating their superiority over classical computers.
Marco Cerezo, one of the paper’s co-authors, said that so far, researchers in quantum machine learning have analyzed how to mitigate the effects of the barren plateau, but they have lacked the theoretical basis to avoid it altogether.LANL’s work demonstrates that some quantum neural networks are virtually immune to the effects of the barren plateau.
“With this assurance, researchers will now be able to sift through quantum computer data about quantum systems and use that information for things like studying material properties or discovering new materials.” said Patrick Coles, a quantum physicist at LANL.
The GRIM module of the QCNN architecture.
For more than 40 years, physicists have believed that quantum computers would prove useful for simulating and understanding quantum systems of particles, which would kill traditional classical computers, and the LANL study demonstrates that robust types of quantum convolutional neural networks hold promise for analyzing quantum simulation data.
“The field of quantum machine learning is off to a late start.” Coles said, “There’s a famous quote about lasers, when they were first discovered, people said they were finding solutions to problems. Now lasers are used everywhere. Similarly, many of us doubt that quantum data will ever become highly available, possibly meaning that quantum machine learning will take off as well.”
For example, according to Coles, research has focused on ceramic materials as high-temperature superconductors, which could improve frictionless transportation, such as magnetic levitation trains. But analyzing data on the large number of phases in a material that are affected by temperature, pressure and impurities, and classifying the phases is a daunting task beyond the capabilities of classical computers. Using scalable quantum neural networks, quantum computers can sift through large datasets about various states of a given material and correlate these states with phases to determine the optimal state for high-temperature superconductivity.
The paper’s author, Arthur Pesah, said, “As the QNN field flourishes, we believe it is important to perform similar analyses on other candidate architectures, and the techniques developed in our work can be used as a blueprint for such analyses.”
Link to paper: https://journals.aps.org/prx/abstract/10.1103/PhysRevX.11.041011
Related story: https://phys.org/news/2021-10-breakthrough -proof-path-quantum-ai.html