### Common Convolutional Network Training Time

Common convolutional network training time is 3 to 6 months. Convolutional neural networks are a class of feed-forward neural networks that contain convolutional computation and have a deep structure, and are one of the representative algorithms for deep learning. Convolutional neural network has the ability of representation learning and can classify the input information according to its hierarchical structure in a translation invariant way, which is also known as translation invariant artificial neural network. With the proposal of deep learning theory and the improvement of numerical computation equipment, convolutional neural network has been developed rapidly, and has been applied to computer vision, natural language processing and other fields. According to the query of relevant public information, convolutional neural network theory is complex and difficult to learn, and the training time usually takes three to six months.

### Convolutional Neural Networks

Notes about convolutional networks in the flower book are recorded at https://www.jianshu.com/p/5a3c90ea0807.

A convolutional neural network (CNN or ConvNet) is a type of locally connected , weight sharing and other properties of deep feedforward neural networks. Convolutional Neural Network is proposed by the mechanism of biological Receptive Field. ReceptiveField mainly refers to the properties of some neurons in the auditory, visual and other nervous systems, i.e., neurons only receive signals within the stimulus region they innervate.

Convolutional neural networks were first used mainly to process image information. When fully-connected feedforward networks are used to process images, there are two problems:

Current convolutional neural networks are generally feedforward neural networks consisting of a cross-stack of convolutional, convergent, and fully-connected layers that are trained using a back-propagation algorithm. Convolutional neural networks have three structural properties: local connectivity, weight sharing, and convergence. These properties give the convolutional neural network a degree of translation, scaling, and rotation invariance.

Convolution is an important operation in analytical mathematics. In signal processing or image processing, one- or two-dimensional convolution is often used.

One-dimensional convolution is often used in signal processing to calculate the delay accumulation of a signal. Suppose that a signal generator produces a signal at each moment t, and that the decay rate of the information is such that, after a time step, the information is a multiple of the original. Assuming that, then the signal received at moment t is the superposition of the information generated at the current moment and the delayed information from previous moments:

We refer to this as a Filter or ConvolutionKernel. Assuming that the filter is of length, its convolution with a signal sequence is:

The convolution of a signal sequence and a filter is defined as:

In general the length of the filter is much smaller than the length of the signal sequence, the following figure gives an example of a one-dimensional convolution, with a filter:

Two-dimensional convolution is often used in image processing. Because the image is a two-dimensional structure, the one-dimensional convolution needs to be extended. Given an image and a filter, the convolution is:

The following figure gives an example of a two-dimensional convolution:

Note that the convolution operation here is not about framing a convolution kernel-sized box in the image and multiplying the pixel values by the individual elements of the convolution kernel and then summing them together, but rather, it’s about rotating the kernel by 180 degrees, and doing the above operation again.

In image processing, convolution is often used as an effective method for feature extraction. An image obtained after a convolution operation is called a FeatureMap.

The top filter is a Gaussian filter, which can be used to smooth and denoise the image; the middle and bottom filters can be used to extract edge features.

In the field of machine learning and image processing, the main function of convolution is to slide a convolution kernel (i.e., a filter) over an image (or some kind of feature), and obtain a new set of features through the convolution operation. In the process of computing convolution, a convolution kernel flip (i.e., the 180-degree rotation mentioned above) is required. In the specific implementation, the convolution is generally replaced by a mutual correlation operation, which will reduce some unnecessary operations or overhead.

Cross-Correlation is a function that measures the correlation of two sequences, and is usually implemented as a dot product computation with a sliding window. Given an image and a convolution kernel, their cross-correlation is:

The difference between cross-correlation and convolution is only whether or not the convolution kernel is flipped. Thus mutual correlation can also be referred to as non-flip convolution. Convolution and mutual off are equivalent when the convolution kernel is a learnable parameter. Therefore, for implementation (or descriptive) convenience, we use mutual correlation instead of convolution. In fact, many of the convolution operations in deep learning tools are actually mutual-gate operations.

Based on the standard definition of convolution, sliding steps and zero padding of filters can also be introduced to increase convolutional diversity and more flexibility in feature extraction.

The filter’s step (Stride) is the time interval at which the filter is sliding.

ZeroPadding is zero-padding at both ends of the input vector.

Suppose that the convolution layer has a number of input neurons, a convolution size, a step size, and zeros filled at each end of the neurons.

There are three types of convolutions commonly used in general:

Because the training of convolutional networks is also based on the back-propagation algorithm, let’s focus on the derivative properties of convolutions:

Assume.

,. function is a scalar function.

Then by having:

It can be seen that the partial derivative about is the convolution of and:

It is similarly obtained that:

When or ,, i.e., it is equivalent to the zero-padding carried out on. Thereby the partial derivatives with respect to are the wide convolution of and .

Expressed in terms of the “convolution” of correlations, this is (note the commutative nature of the wide convolution operation):

In a fully-connected feed-forward neural network, if there’s a neuron in the first layer, a neuron in the first layer, and a neuron in the first layer, there’s one on the connecting side, i.e., there’s a parameter to the weight matrix. When and are both large, the weight matrix has very many parameters, and training can be very inefficient.

If convolution is used instead of full connectivity, the net input to the first layer is the convolution of the first layer’s activity value and the filter, i.e.

Based on the definition of convolution, there are two very important properties of convolutional layers:

Because of the local connectivity and the sharing of the weights, the convolutional layer has only one parameter, an m-dimensional weight and a 1-dimensional bias, for a total of one parameter. The number of parameters is independent of the number of neurons. In addition, the number of neurons in the first layer is not chosen arbitrarily, but is satisfied.

The role of the convolutional layer is to extract features from a local region, and different convolutional kernels correspond to different feature extractors.

FeatureMap (FeatureMap) for an image (or other feature maps) in the features extracted by convolution, each feature map can be used as a class of extracted image features. In order to improve the representation capability of the convolutional network, multiple different feature maps can be used at each layer to better represent the features of the image.

In the input layer, the feature mapping is the image itself. If it’s a grayscale image, it’s the one feature mapping with depth; if it’s a color image, it’s the feature mapping with depth for each of the three RGB color channels.

Without loss of generality, suppose a convolutional layer is structured as follows:

In order to compute the output feature mapping, a convolutional kernel is used to convolve the input feature mappings separately, and then the results of the convolution are summed up and a scalar bias is added to get the net input of the convolutional layer and then the output feature mapping is obtained after a nonlinear activation function.

In a convolutional layer with inputs and outputs, each output feature map requires a filter and a bias. Assuming the size of each filter, a total of one parameter is required.

The Pooling Layer, also known as the Subsampling Layer, is used to perform feature selection, reducing the number of features and thus the number of parameters.

There are two commonly used aggregation functions:

Where is the activation value for each neuron in the region.

It can be seen that the convergence layer not only effectively reduces the number of neurons, but also allows the network to remain invariant to some small local morphological changes and to have a larger receptive field.

A typical convergence layer divides each feature mapping into non-overlapping regions of size, which are then downsampled using maximum convergence. The convergence layer can also be viewed as a special convolutional layer with a convolutional kernel of size and step size, with the convolutional kernel being a function or functions. Too large a sampling region drastically reduces the number of neurons and can cause too much information loss.

A typical convolutional network is a cross-stack of convolutional, convergence, and fully connected layers.

The current commonly used convolutional network structure is shown in the figure, a convolutional block is consecutive convolutional layers and a convergence layer (usually set to, for or). A convolutional network can be stacked with consecutive convolutional blocks, followed by a fully-connected layer (with a large value range, such as or larger; usually set to).

Currently, there is a tendency to use smaller convolutional kernels (e.g., and ) and deeper structures (e.g., layers greater than 50) throughout the network structure. In addition, the role of convergence layers has become less useful as well, due to the increasingly flexible operationalization of convolution (e.g., different step sizes), and thus the proportion of convergence layers in the more popular convolutional networks nowadays is gradually decreasing, tending towards all-convolutional networks.

In fully-connected feedforward neural networks, the gradient is mainly back-propagated through the error term in each layer, and the gradient of the parameters in each layer is further computed. In a convolutional neural network, there are two main neural layers with different functions: the convolutional layer and the convergence layer. While the parameters are the convolution kernel as well as the bias, so only the gradient of the parameters in the convolution layer needs to be computed.

Without loss of generality, the first layer is a convolutional layer, the input feature mapping of the first layer is, and the net input of the feature mapping of the first layer is obtained through convolutional computation, and the net input of the first feature mapping of the first layer is obtained

By the same reasoning, the partial derivatives of the loss function with respect to the first bias of the first layer are:

In a convolutional network, the gradient of each layer’s parameters depends on the error term of the layer in which it is placed.

The error terms are calculated differently in the convolutional and convergence layers, so we calculate their error terms separately.

The specific derivation of the error term for the first feature mapping in the first layer is as follows:

Where is the derivative of the activation function used in the first layer, and is the upsampling function (upsampling), which is just the opposite of the downsampling operation used in the convergence layer. If the downsampling is maxpooling, each value of the error term is passed directly to the neuron corresponding to the maximum value in the corresponding region of the previous layer, and the error terms of the other neurons in the region are set to 0. If the downsampling is meanpooling, each value of the error term is equally distributed to all neurons in the corresponding region of the previous layer.

The exact derivation of the error term for the first feature mapping of the first layer is as follows:

Where is wide convolution.

LeNet-5 is a very successful neural network model although it was proposed earlier. A handwritten digit recognition system based on LeNet-5 was used by many banks in the United States in the 1990s to recognize handwritten digits on top of checks.The network structure of LeNet-5 is shown in the figure:

Excluding the input layers, there are seven layers in LeNet-5, and each layer is structured as follows:

AlexNet was the first modern deep convolutional network model, and its first use of many of the technical approaches of modern deep convolutional networks, such as the use of ReLU as a nonlinear activation function, the use of Dropout to prevent overfitting, and the use of data augmentation to improve model accuracy, etc. AlexNet won the 2012 ImageNet image classification competition.

The structure of AlexNet is shown in the figure, including five convolutional layers, three fully connected layers and one softmax layer. Because the size of the network exceeded the memory limitations of a single GPU at the time, AlexNet split the network in half and placed it on two separate GPUs, which communicated with each other only on certain layers (such as layer 3).

The exact structure of AlexNet is as follows:

In a convolutional network, how to set the size of the convolution kernel of a convolutional layer is a very critical issue. In Inception networks, a convolutional layer contains multiple convolutional operations of different sizes, called Inception modules.Inception networks are made up of a stack with multiple inception modules and a small number of convergence layers.

In the v1 version of Inception module, four parallel sets of feature extraction are used, which are 1×1, 3×3, 5×5 convolution and 3×3 maximum convergence. Meanwhile, in order to improve the computational efficiency and reduce the number of parameters, the Inception module performs a 1×1 convolution to reduce the depth of feature mappings before performing 3×3 and 5×5 convolution and after 3×3 maximum convergence. If there is redundant information between the input feature mappings, the 1×1 convolution is equivalent to performing a feature extraction first.