Correlation coefficient matrix for principal component analysis

What is the difference between correlation coefficient matrix and covariance matrix for principal component analysis?

In statistics and probability theory, correlation and covariance matrices, and mutual correlation and covariance matrices can be computed by calculating the number of self- and mutual-correlations, and the self- and mutual-covariances, between a random vector (x in the case of autocorrelation or self-covariance and x,y in the case of mutual-correlation or covariance) and its ith and jth random vectors (i.e., vectors composed of random variables). This is a natural generalization from scalar random variables to higher dimensional random vectors.

Correlation matrix: also called correlation coefficient matrix, which is composed of the correlation coefficients between the columns of the matrix. That is, the elements of the ith row and jth column of the correlation matrix are the correlation coefficients of the ith and jth columns of the original matrix.

Covariance matrix: in statistics and probability theory, each element of the covariance matrix is the covariance between the elements of the individual vectors, and is a natural generalization from scalar random variables to higher dimensional random vectors.

The correlation coefficient matrix and the covariance matrix are mainly used to describe the degree of correlation between the rows and columns of a matrix vector.

How to do principal component analysis

You can use matlab software to use principal component analysis. The specific steps are as follows:

①Standardize the data as shown below:

②Then calculate the sample covariance matrix, also known as the correlation coefficient matrix, the specific process is shown in the following figure:

③Calculate the eigenvalues and eigenvectors of R

④Calculate the principal component contribution rate and the cumulative contribution rate, and its The calculation formula is shown below:

⑤ Write the principal components and take the components whose cumulative contribution is more than 80%

⑥ Finally, use the results for the subsequent analysis

This can all be realized in Matlab software, and the detailed code is shown below:

In short, the use of principal component analysis can solve the problem of multiple covariance and is a necessary basic method for regression and clustering.

What is the difference between correlation coefficient matrix and covariance matrix for principal component analysis

Correlation coefficient matrix: it is equivalent to a matrix representing the correlation between the variables without eliminating the measures

Covariance matrix: it is a matrix representing the correlation between the variables without eliminating the measures.

You compare their equation transformation relationship:

r=cov(x,y)/d(x)d(y)

Check out my blog http://blog.csdn.net/yugao1986/article/details/6878578

Principal Component Analysis (PCA)

When analyzing the benefits of land reclamation of disaster-damaged land, many factors will be encountered, and the factors are related to each other, and these related factors will be synthesized mathematically into a few final factors, so that these new factors contain the information of the original factors but are independent of each other. Simplifying the problem and capturing its essence is the key to the analysis process, and principal component analysis can solve this challenge.

(A) the basic principles of principal component analysis

Principal Components Analysis (PrincipalComponentsAnalysis, PCA) is a statistical analysis of the original multi-variable into a small number of comprehensive indicators. From a mathematical point of view, this is a downscaling processing method, that is, through the study of the relationship between the results within the correlation matrix of the original indicators, the original indicators are reassembled into a new set of mutually independent indicators, and from which a few composite indicators are selected to reflect the information of the original indicators. Assuming that there are n evaluation units, each of which is described by m factors, this constitutes an n×m order data matrix:

Disaster-damaged land reclamation

If m factors are noted as x1, x2, …, xm, and their composite factors as z1, z2, …, zp (p≤m), the Then:

Disaster-damaged land reclamation

The factor lij is determined by the following principles:

(1) zi and zj (i ≠ j, i, j = 1, 2, …, p) are independent of each other;

(2) z1 is the one that, among all linear combinations of x1, x2, …, xm with the largest variance, and so on.

Based on this principle to determine the composite variable indicators z1, z2, …, zp are referred to as the original indicators of the 1st, 2nd, …, pth principal components, analysis can be selected only the first few principal components with the largest variance.

(II) Steps of principal component analysis

(1) Standardize the original data to eliminate the differences in the original data in terms of order of magnitude or scale.

(2) Calculate the standardized correlation data matrix:

Disaster-damaged land reclamation

(3) Find the eigenvalues of the correlation coefficient matrix R (λ1, λ2, …, λp) and the corresponding eigenvectors αi = (αi1, αi2, …, αip) by using the Jacobi method , i = 1, 2, …, p.

(4) Select the significant principal components and write their expressions.

P principal components can be obtained by principal component analysis, but because the variance of each principal component and the amount of information it contains are decreasing, so in the actual analysis, generally do not select P principal components, but according to the size of the cumulative contribution rate of each principal component to select the first K principal components, where the contribution rate refers to the proportion of the variance of a particular principal component in the variance of the whole, and indeed The proportion of an eigenvalue in the total of all eigenvalues. That is:

Disaster damaged land reclamation

This indicates that the stronger the information of the original variables contained in the principal components, the greater the contribution rate. The cumulative contribution rate of the principal components determines the selection of the number of principal components K. In order to ensure that the composite variable can include the vast majority of the information of the original variable, it is generally required that the cumulative contribution rate reaches 85% or more.

In addition, in the process of practical application, after selecting the principal components, attention should also be paid to the interpretation of the actual meaning of the principal components. How to give a new meaning to the principal components, give a reasonable interpretation is a quite critical issue in principal component analysis. Generally speaking, this interpretation needs to be based on the coefficients of the principal component expressions and effectively combined with qualitative analysis. Principal components are linear combinations of the original variables, in which the coefficients of the variables are positive or negative, large or small, and some are of comparable sizes, so that the principal components cannot be simply regarded as the attributes of the original variables. The larger the absolute value of the coefficients of the variables in the linear combination indicates that the main component mainly contains the variable; if there are several variables of comparable size coefficients, it is considered that the main component is the synthesis of these variables, and these variables together with what practical significance, it is necessary to combine with the specific issues and specialties, to give a reasonable explanation, and then only to achieve the purpose of accurate analysis.

(5) Calculate the principal component score. According to the standardized raw data, each sample is substituted into the principal component expression, you can get the new data of each sample under each principal component, that is, the principal component score. The specific form can be as follows:

Disaster damaged land reclamation

(6) Based on the data of the principal component score, then further statistical analysis can be carried out. Among the common applications are principal component regression, selection of variable subsets, and comprehensive evaluation.

(C) Evaluation of Principal Component Analysis

The evaluation of the benefits generated by reclamation through principal component analysis can transform multiple indicators into as few comprehensive indicators as possible, so that the comprehensive indicators are not related to each other, which reduces the overlap of the information of the original indicators without losing the total content of the information of the original indicators. The method not only transforms multiple indicators into comprehensive indicators, but also analyzes the influencing factors of each principal component, so as to discern the key factors affecting the whole evaluation system, and the principal component analysis method can be assigned scientifically when determining the weights to avoid the influence of subjective factors.

It should be noted that, although the principal component analysis method can scientifically and quantitatively calculate the weight of each principal component to avoid the influence of human factors and subjective factors, sometimes the results of the assignment of weights may be in some error with the objective reality. Therefore, using principal component analysis to determine the weights and then combining the weights given by different experts is the best solution. This can make a qualitative analysis on the basis of quantitative, through a certain mathematical and theoretical methods to combine the two kinds of data to consider.