3D convolutional neural network

alexnet network is a three-dimensional convolutional neural network

AlexNet is a model for image classification using a convolutional neural network (CNN), which contains a three-dimensional convolutional (3DConvolution) layer. Three-dimensional convolutional neural network (3DCNN) is a deep learning model that is mainly used to process three-dimensional data such as video and audio. Unlike 2D Convolutional Neural Networks (2DCNN), 3DCNN is capable of processing temporal data, i.e., it performs convolution operations in the time dimension.The convolutional layer in AlexNet uses 3D convolutional operations, which treats the input data as a three-dimensional tensor and performs convolution operations in three dimensions, the last of which denotes the color channel. Therefore, the AlexNet network can be considered as a convolutional neural network that contains a 3D convolutional layer.

A brief description of the structure of convolutional neural networks

The structure of a convolutional neural network is as follows:

1. Input layer.

The input layer is the input to the entire neural network, which generally represents the pixel matrix of a picture in a convolutional neural network dealing with images.

2. Convolutional layer.

As you can see from the name, the convolutional layer is the most important part of a convolutional neural network. Unlike a traditional fully connected layer, the input to each node in the convolutional layer is just a small piece of the previous layer of the neural network, which has a size of 3*3 or 5*5.

3. Pooling layer.

The pooling layer neural network does not change the depth of the three-dimensional matrix, but it can reduce the size of the matrix. The pooling operation can be thought of as converting a higher resolution image into a lower resolution image.

4. Fully connected layer.

After several rounds of convolutional and pooling layer processing, at the end of the convolutional neural network will generally be 1 or 2 fully connected layers to give the final classification results. After several rounds of processing of convolutional and pooling layers, it can be assumed that the information in the image has been abstracted into features with higher information content.

5. Softmax layer.

Softmax layer is mainly used for classification problems. After Softmax layer, the probability distribution of belonging to different categories in the current sample can be obtained.

Introduction of Convolutional Neural Networks:

Convolutional neural networks are a class of feed-forward neural networks that contain convolutional computation and have a deep structure, which is one of the representative algorithms for deep learning. Convolutional neural networks have the ability to learn representations, and can classify input information according to its hierarchical structure in a translation invariant way, so they are also called “translation invariant artificial neural networks”.

Research on convolutional neural networks began in the 1980s and 1990s, and time-delay networks and LeNet-5 were the first convolutional neural networks to appear; in the twenty-first century, with the introduction of the theory of deep learning and the improvement of numerical computation equipment, convolutional neural networks have been developed rapidly, and have been applied to computer vision, natural language processing and other fields.

Convolutional neural networks are modeled after biological visual perceptual mechanisms, and can perform both supervised and unsupervised learning. The sharing of convolutional kernel parameters within the hidden layers and the sparsity of inter-layer connections allow convolutional neural networks to learn gridded features, such as pixels and audio, with less computational effort, stable results, and no additional feature engineering requirements on the data.

CNN (Convolutional Neural Network) Algorithm

Basics Explained:

Convolution: a mathematical operator that generates a third function from two functions f and g. Characterizes the integral of the product of the function values of the overlapping portions of the functions f and g after flipping and translating over the overlap length.

Feedforward neural network: each neuron is arranged in layers, each neuron is connected only to the neurons in the previous layer, receives the output of the previous layer and outputs it to the next layer. There is no feedback between the layers.

Convolutional neural network: a class of feed-forward neural networks that contain convolutional computation and have a deep structure

Convolutional kernel: that is, when image processing, given an input image, the pixels in a small region of the input image are weighted to become each corresponding pixel in the output image, in which the weights are defined by a function that is called a convolutional kernel.

Downsampling: Sampling a sequence of samples at intervals of several samples so that a new sequence is downsampled from the original sequence.


Input layer: used for data input

Convolutional layer: use convolutional kernel for feature extraction and feature mapping

Excitation layer: non-linear mapping, convolution is a linear mapping, to make up for the shortcomings

Pooling layer: downsampling, sparse processing of feature maps, to reduce the amount of data computation

Fully-connected layer: re-fitting the CNN at the tail. Input layer:

Input layer:

In the input layer of the CNN, the format of the (image) data input is not quite the same as that of the input format of the fully connected neural network (one-dimensional vectors).The input format of the CNN’s input layer preserves the structure of the image itself.

For a black-and-white 28×28 picture, the CNN’s input is a 28×28 2D neuron:

And for a 28×28 picture in RGB format, the CNN’s input is a 3×28×28 3D neuron (there is one 28×28 matrix for each color channel in RGB)

Convolutional layer:

The left side is the input, the middle part is two different filters Filterw0, Filterw1, and the far right side is two different outputs.


wm,n:the value of the mth row and the nth column of the filter

xi,j: denotes the element of the ith row and the jth column of the image

wb: is used to denote the bias item of the filter

ai,j: denotes the element of the the i-th row and j-th column element of the FeatureMap

f: denotes the Relu activation function

Excitation Layer:

The excitation function used is typically the ReLu function:


Convolutional and excitation layers are often combined together as the “convolution layer”.

Pooling layer:

When the input passes through the convolutional layer, if the sense field of view is smaller, the cloth length stride is smaller, and the obtained featuremap (featuremap) is still larger, you can use the pooling layer to perform a dimensionality reduction operation on each featuremap, and the depth of the output is still unchanged, and remains the same as the number of featuremaps.

The pooling layer also has a “pooling field of view (filter)” to scan the featuremap matrix, the “pooling field of view” in the matrix value is calculated, there are generally two kinds of calculation:

Maxpooling: take the maximum value in the “pooled horizon” matrix

Averagepooling: take the average value in the “pooled horizon” matrix

Training process:

1. p>

1. Forward calculate the output value aj of each neuron (denote the jth neuron of the network, the same as below);

2. Backward calculate the error term σj of each neuron, σj is also called sensitivity in some literature. It is actually the partial derivative of the network’s loss function Ed with respect to the neuron’s weighted input

3. Calculate the gradient of each neuron’s connection weights wi,j (wi,j denotes the weight of the connection from neuron i to neuron j)

1. Finally, just update each weight according to the gradient descent law.

Reference: https://blog.csdn.net/love__live1/article/details/79481052

Principle of Convolutional Neural Network

Convolutional Neural Network (CNN) is a kind of feed-forward neural network, which is inspired by the natural visual cognitive mechanism of living beings. Nowadays, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, because the network avoids the complex pre-processing of the image, and can be directly input to the original image, so it has been more widely used. It can be applied to image classification, target recognition, target detection, semantic segmentation and so on. The basic structure of convolutional neural network for image classification.

1. Definition

Convolutional Neural Networks (CNNs) are a class of feedforward neural networks that contain convolutional computation and have deep structure. It is one of the representative algorithms of deeplearning. Convolutional neural networks have the ability of representationlearning and can perform shift-invariantclassification of input information according to their hierarchical structure, so they are also called “Shift-invariantArtificialNeuralNetworks” (Shift-invariantArtificialNeuralNetworks). Invariant Artificial Neural Networks (SIANN)”.

2. Characteristics

Compared with the previously introduced neural networks, traditional neural networks have only linear connections, while CNNs include **convolution** operations, **pooling operations, and nonlinear activation function mapping (i.e., linear connections)** and so on.

3. Applications and Typical Networks

Classical CNN Networks:




Common Applications:

Deep Learning has been used with great success in computer image recognition. Using deep learning, we are able to recognize images with high accuracy, and to achieve this, we rely heavily on a branch of neural networks called convolutional networks

3dr2n2 Principle

You’re asking what the 3dr2n2 principle is based on, right? It’s based on deep learning and convolutional neural networks. Specifically, 3dr2n2 uses an architecture based on a 3D convolutional neural network that extracts features from point cloud data and converts them into 2D images. These 2D images can be passed to a 2D convolutional neural network for classification and recognition. In this way, 3dr2n2 can efficiently recognize and classify 3D objects without explicitly performing 3D reconstruction.