How to count the number of convolutional neural network iterations

Convolutional neural network parameter analysis

(1) phenomenon:

(1-1) a one-time batch number of samples feed neural network, forward propagation; and then adjust the weights, so that a whole process is called a round (epoch), that is, a batch size samples of the whole process is an iteration.

(1-2) the training data chunks, made into a batch (batchtraining) training can be multiple training data elements of the lossfunction sum, using the gradient descent method, minimize the sum of the lossfunction, and then optimize the parameters of the neural network update

(2) One iteration: including forward propagation to calculate the output vector, output vector and label’s loss calculation and backward propagation to find the loss to weight vector w derivative (gradient descent method calculation), and realize the update of weight vector w.

(3) Advantages:

(a) Accurate estimation of the gradient vector (the derivative of the cost function to the weight vector w), to ensure the fastest rate of descent to the convergence of the local minima; a batch of one batch at a time of gradient descent;

(b) Parallel operation of the learning process;

(c) The learning process. p>

(c) algorithmic effects closer to stochastic gradient descent;

(d) BatchNormalization uses the statistical mean and deviation of the same batch to regularize the data, accelerating the training and sometimes improving the correctness rate [7]

(4) Realistic engineering problems: there are computer storage problems, the size of the batch loaded at one time is affected by memory;

(5) batch parameter selection:

(5-1) From the point of view of the convergence speed, the set of samples in small batches is optimal, which is what we call mini-batch, at which point the batch sizes tend to range from The batch size tends to vary from a few tens to a few hundred, but generally not more than a few thousand

(5-2) The GPU can perform better for batch of powers of 2, so setting it to 16, 32, 64, 128… tends to perform better when set to whole 10, whole 100 multiples

(6) 4 ways to accelerate batch gradient descent [8]:

(6-1) Use momentum – use the speed of the weights rather than their position to change the weights.

(6-2) Use different learning rates for different weight parameters.

(6-3) RMSProp-This is a MeanSquare improved form of Prop, Rprop uses only the sign of the gradient, and RMSProp is an averaged version of it for Minibatches

(6-4) Optimization method using curvature information.

(1) Definition: when using gradient descent algorithm to optimize the loss cost function, the update rule of the weight vector will be multiplied by a coefficient in front of the gradient term, which is called the learning rate η

(2) Effect:

(2-1) The smaller the learning rate η is, the more each iteration of the weight vector change is small, the learning speed is slow, the trajectory is smoother in the weight space, and the convergence is slow;

(2-2) The larger the learning rate η is, the greater the change in the weight vectors in each iteration, the faster the learning speed is, but there is a possibility that the change is in the middle of an oscillation, and it is impossible to converge;

(3) Handling method:

(3-1) Methods to both speed up learning and maintain stability modify the delta law, i.e., add a momentum term.

(4) Selection experience:

(4-1) Manual adjustment based on experience. By trying different fixed learning rates, such as 0.1,0.01,0.001, etc., and observing the relationship between the number of iterations and the change of loss, find the learning rate corresponding to the fastest decreasing relationship of loss.

(4-2) Strategy-based tuning.

(4-2-1) fixed, exponential, polynomial

(4-2-2) Adaptive dynamic tuning. adelta, adagrad, ftrl, momentum, rmsprop, sgd

(5) Adjustment of learning rate η: learning rate achieves adaptive adjustment (typically decay) during learning

(5-1) Non-adaptive learning rate may not be optimal.

(5-2) Momentum is a parameter of an adaptive learning rate method that allows the use of higher speeds along shallow directions while moving forward at reduced speeds along steeper directions

(5-3) Reducing the learning rate is necessary because higher learning rates are likely to get stuck in local minima during training.


[1] SimonHaykin.Neural Networks and Machine Learning[M]. Mechanical Industry Press,2011.

[2] How to determine the size of the batch when training neural networks?

[3] Learning note: The effect of BatchSize on the predictive power of deep neural networks

[4] How to select hyperparameters in machine learning algorithms: learning rate, regular term coefficients, minibatchsize. u012162613/article/details/44265967

[5] How to set the learning rate in deep learning.

p>[6] Adjusting the learning rate to optimize neural network training.

[7] What are some of the methods used in machine learning to prevent overfitting?

[8] NeuralNetworksforMachineLearningbyGeoffreyHinton.

[9] How to determine the size of the convolution kernel, the number of convolution layers, and the number of maps per layer for a convolutional neural network

[10] How is the convolutional neural network’s convolutional kernel size, number of convolutional layers, and number of MAPs per layer determined?

Number of Parameters and Amount of Computation in Convolutional Neural Networks

When designing a convolutional neural network, the size of the network needs to be considered, i.e., the number of parameters and the amount of computation, which refers to the number of parameters in the convolutional kernel, and the amount of computation refers to the number of times the convolutional neural network performs numerical operations. Due to parameter sharing, the number of parameters is only related to the number of featuremaps and not to the size of the featuremap, and the amount of computation is related to both.

The number of parameters for a common convolutional operation is:

Similarly, the computational amount is

Deep separable convolution is primarily a one-to-one convolution, and thus the computational amount is

Followed by convolutional feature fusion

CNN principle analysis

Then through the visualization of CNN can be seen, CNN recognition of the object is achieved through the local to the whole to the realization of the CNN recognition of the local features, as well as the local features of the corresponding position, can be pieced together to the overall recognition.

CNN is composed of convolutional layers, sampling layers and fully connected layers, the general process is like this:

For an input image, only one channel is considered, then, for a two-dimensional matrix, the following figure as an example, a 5 * 5 image, after a 3 * 3 filter, a 3 * 3 result is obtained, the process of arithmetic is like this : the blue box in the 3 * 3 matrix and filter for the operation, got the result matrix in the blue 4, the operation is the same position of each value multiply, and then add up the nine numbers can be. Then the convolution kernel is shifted one unit to the right, and the nine numbers in the red box for the operation, to get the result of the red 3, in turn, so the right shift and downward calculation can get the final result,

The following is a moving picture, can be more intuitive:

Actually, the input picture is generally RGB format, that is, three channels, then a time need to be three convolution kernels,

The formula for convolution: after inputting a picture, the result is convolved, and the output is somewhat related to the original picture and the size of the convolution kernel, a few concepts are introduced first:

The formula is given below:


In fact, the Just such a simple operation can make us much more efficient with the following advantages:


Introduction to Convolutional Neural Networks (CNN)

ConvolutionalNeuralNetworks (CNN)

Convolutional Neural Network Model Parameter Counts and Operations Calculations

This article is the derivation of formulas and methods for calculating model parameter counts and floating point operations for convolutional neural networks, to calculate these data automatically using the API, please move to another blog: Automatically Calculating Model Parameter Counts, FLOPs, Multiply-Add and Memory Required

Where denotes the number of output channels, denotes the number of input channels, denotes the convolution kernel width, and denotes the convolution kernel height.

Brackets denote the number of weights of a convolution kernel, +1 denotes bias, and brackets denote the number of parameters of a convolution kernel, indicating that there is a convolution kernel in the layer.

If the convolution kernel is square, i.e., then the above equation becomes:

It should be noted that bias is not required when using BatchNormalization, at which point the +1 term in the computational equation is removed.

FLOPs is an acronym for English floatingpointoperations, which means floating-point operations, the value in the middle bracket indicates the amount of operations (multiplication and addition) required for a convolution operation to compute a point in the featuremap, which indicates the amount of multiplication in one convolution operation, which indicates the amount of addition in one convolution operation operation, +1 denotes bias, W and H denote the length and width of the featuremap, respectively, and denotes the number of all elements of the featuremap.

If it is a square convolution kernel, i.e., there are:

The above is the sum of the multiply and add operations, treating either a multiply or an add operation as a floating point operation.

In computer vision papers, a ‘multiply-add’ combination is often regarded as a single floating-point operation, expressed in English as ‘Multi-Add’, with the operation exactly halved by the algorithm above, at which point the operation is:

Notably, the The vectors initially flattened from the featuremap are considered to be the first fully connected layer, i.e., here.

It is possible to understand the above equation in this way: each output neuron connects to all input neurons, so there is a weight, and a bias is added to each output neuron.

It is also possible to understand the above equation in this way: each layer of neurons (O this layer) has the number of weights, and the number of bias is O.


The value of the middle bracket denotes the number of operations required to compute a neuron, the first denotes the number of multiplication operations, denotes the number of addition operations, and +1 denotes the bias, which denotes the value of computing O neurons.

Grouped convolution and depth-separated convolution to be more ……

Is epoch the number of iterations

What does epochs mean in matlab

x=rand(2, 2001); generates two rows of 2001 columns obeying a uniform distribution of random numbers. (A total of 4002) The default is from 0 to can be transformed to a random number from -10 to 10 by means of 20 * x + 10 (the owner’s code can also be).

The function newff builds a trainable feedforward network. This takes 4 input parameters. The first parameter is an Rx2 matrix to define the minimum and maximum values of the R input vectors. The second parameter is an array that sets the number of neurons per layer.

sim is the simulation. net is the network that has been trained earlier, which is equivalent to substituting R as an independent variable into net. I’m just figuring out neural networks, so please point out that a time series should be one row of data corresponding to another row of data. Equivalent to x corresponding to y.

Not sure what you mean by asking. I’ll try to answer, P1 stands for the first input and P2 stands for the second input, when trained, the two inputs are fed to the network, and the network outputs the target GOAL.

Normalize the input matrix and the target matrix first, failure to do so may result in non-convergence of the network.

Shows all the instructions, is a neural network processing program, first generates a forward neural network – BP neural network, set the weights and thresholds, and then use the TRAINGDM algorithm to train the BP neural network, and finally is the BP neural network simulation, there is no nothing, the comments have been written in detail.

Quickly understand the difference between epoch, iteration and batch

1, then iteration is 100, because the 1000 samples will be trained, each time the input of 10 samples, the need for 100 iterations, that is: all the samples to complete a back-propagation as an epoch.

2, epoch and iteration: 1 iteration is equal to using batchsize samples to train once; epoch: 1 epoch is equal to using all the samples in the training set to train once; for example, to do 100 iteration is equal to doing 1 epoch training.

3, however, when the number of samples of an Epoch (that is, all the training samples) may be too large (for the computer), it needs to be divided into multiple small pieces, that is, is divided into a number of Batch for training. Divide the entire training samples into several Batches. the size of each batch.

Neural Network Hyperparameter Selection

1, so try to choose the activation function whose output is characterized by ZERO-CENTERED in order to speed up the convergence of the model.

2, for example, in the BP neural network, the purpose is mainly to select the number of layers of the model, the activation function of the neurons, the number of neurons per layer of the model (the so-called hyper-parameters), each layer of the network neuron connections of the final weight is after the model selection (i.e., K-fold cross validation), by all the training data retraining.

3, the selection of training neural network first selected batchsize, and then adjust the other hyperparameters. And practically speaking, there are two principles – batchsize is not too small, not too big, and anything else. Because the appropriate batchsize range has no significant relationship with the size of the training data, the number of neural network layers, and the number of units.

4. The choice of architecture and hyperparameters follows. In the first round, the localizer model is applied to the maximum-minimum center square crop in the image. The size of the crop is adjusted to the network input size is220×220. a single pass through this network gives us hundreds of candidate date frames.

5. However, the ultra-high accuracy of DNN comes at the cost of ultra-high computational complexity. Computing engines in the usual sense, especially GPUs, are the foundation of DNNs.

6. For the BP neural network regression overfitting problem, it is recommended to try to solve it using L1 regularization and dropout methods. If feature selection is required, L1 regularization can be used. If you need to improve the generalization ability of the network, you can use the dropout method.

What is the significance of multiple epochs for deep learning

Training is generally done using stochasticgradientdescent (SGD), where one batch is selected for update in one iteration. one epoch means that the number of iterations * the number of batch is the same as the number of training data, which is an epoch.

The network can use L1 regularization and dropout methods.

iteration: 1 iteration is equal to using batchsize samples to train once; epoch: 1 epoch is equal to using all the samples in the training set to train once; for example, to do 100 iteration is equal to doing 1 epoch training. epoch and iteration both refer to deep memory. iteration both refer to deep learning.

Improving memory utilization through parallelization. The number of iterations in a single epoch is reduced, increasing the speed of operation. (Single epoch=(all training samples/batchsize)/iteration=1) Increase Batch_Size appropriately, the accuracy of gradient descent direction is increased, and the magnitude of the training shock is reduced.

Is the bp neural network algorithm iterated once for all samples

Yes, all samples are counted once. The samples are taken in order, substituted into the BP algorithm, and the weights are adjusted. There are also some algorithms that do this in a randomized fashion, where the samples come in in a different order each time, but still all samples are involved.

The only possible difference is that in the standard BP algorithm, the error is passed back and the weights are adjusted for each input sample, and this rotation of each sample is called “single-sample training”. Because the single-sample training follows the principle of “localism”, which only focuses on the error generated by each sample, it is inevitable to lose sight of the other, so that the number of training times increase, resulting in slow convergence. Therefore, there is another method, that is, after all the samples are input, the total error of the network is calculated, and then adjust the weights according to the total error, and this batch processing of the cumulative error is called “batch training” or “cycle training”. This batch training is called “batch training” or “cycle training”. When the number of samples is large, batch training has a faster convergence rate than single-sample training.