### [High reward] How do you find the Pearson correlation coefficient matrix, huh?

1 Methods

Property 1: Let X be a random variable with distribution function F(x), then Y=F(X) obeys a uniform distribution in [0, 1].

Property 2: Let X1,K,Xn be a simple sample of a certain distribution, whose distribution function is F(x), from property 1, in a probabilistic sense, F(X1),F(X2),K,F(Xn) are uniformly distributed on (0,1) in order from smallest to largest, denoted as F(X�1),F(X�2),K,F(X�n), whose corresponding theoretical value should be ri=i-0,5[]n, i=1,2,…,n, and the values of the inverse function of the corresponding distribution function, F-1(r1),F-1(r2),K,F-1(rn) (which is the chi-square score in the chi-square distribution), should be very close to X�1,X�2K,X�n, and therefore, in a probabilistic sense, the scatters (X�1,F-1(r1)), (X�2),K,F(X�n), (X�2), (X�1,F-2),K,F(X�n),F(X�n), and (X�2,F-1(r2)),L,(X�n,F-1(rn)) should be on a straight line.

According to Property 2, if X obeys a normal distribution, the scatter should theoretically fall on a straight line, and the Pearson coefficient can be used to portray this distribution. However, the Pearson coefficient is not equal to 1 due to the presence of random variation, so the lower limit of the 95% boundary value of the Pearson coefficient is formulated by means of stochastic simulation.

Property 3: From the conditional probability formula P(X,Y)=P(Y|X)P(X), it can be seen that: (X, Y) to obey the binary normal distribution of sufficiently necessary conditions is fixed X, Y obey the normal distribution (conditional probability distribution) and the marginal distribution of X is a normal distribution. By the nature of linear regression ε = Y – (α + βX) can be seen, fixed X, Y of the conditional probability distribution of normal distribution is sufficiently necessary for the residuals of the linear regression ε to obey the normal distribution, which can be obtained: (X, Y) to obey the binary normal distribution of the sufficient conditions necessary for the marginal distribution of X is a normal distribution as well as linear regression model Y = α + βX + ε in the residuals of the residuals to obey a normal distribution.

Set X from the normal overall, from the normal overall random simulation sampling 5000 times, each time the sample content of the sample were 7 to 50, the F (x) to find the rank, the Pearson correlation coefficient of the ranked F (x) and the ranked X. Table 1 Boundary values of Pearson correlation coefficients obtained from 5000 random simulations for testing the normal distribution (omitted)

Similarly, we can use the same method to obtain a table of bounds on Pearson correlation coefficients for testing the chi-square distribution (simplified table) Table 2 Table of bounds on correlation coefficients (omitted)

2 Randomized simulation for validation

2�2 1 Pearson correlation coefficient bounding table of random simulation validation

Let X from the normal overall, from the normal overall random simulation sampling 5000 times, each sample content of 10, 20, 30, 40, 50, and calculate the corresponding Pearson chi-square coefficient, as well as the proportion of fall outside the bounding value, i.e., the proportion of the rejection, and then in the same batch of data under the premise of McNemar test. The McNemar test was used to compare the difference between this method and Swilk’s method with the same batch of data. Table 3 (one-dimensional normal distribution) number of simulations (omitted) Table 4 (one-dimensional skewed distribution, χ2) number of simulations (omitted)

The above method of the rejection ratio in the sample size of 7 credible interval for [78.37%, 94.12%], in the rest of the sample size is close to 100%, can be confirmed to be correct.

2�2 Stochastic simulation verification of the bounding table of the chi-square distribution

Table 5 chi-square distribution: simulation of 5000 times (omitted)

2�3 Stochastic simulation verification of the distance of the Mahalanobis distance

According to the definition of the distance of the Mahalanobis distance, from the overall normal distribution of the randomly selected sample sizes of 10, 20, 30, 40, 50 respectively, the simulation of samples 5000 times, according to the above mentioned method to Pearson’s coefficient by chi-square score for X�1,X�2K,X�n, and according to the above table of correlation coefficient bounds, the corresponding statistics, i.e., rejection ratio, are calculated. Table 6 Proportion of Mahalanobis distances falling outside the Pearson coefficient bounding value table (omitted)

2�4 Random simulation validation of binary normal distribution information

Setting up a two-dimensional matrix A, the eigenvalues P and the eigenvectors Z are found respectively, and letting all the elements of X come from the normal overall distribution, then Y=Z′×X must obey the binary normal distribution, and the random simulation is performed for 5000 times according to the The percentage of rejection verified by the method introduced in Property III is as follows. Table 7 (Binary Normal Distribution) Number of Simulations (omitted) Table 8 (Binary Skewed Distribution, χ2) Number of Simulations (omitted)

2�5 Random Simulation Verification of Ternary Normal Distribution Data

Similarly, 5000 random simulations are performed with the same method to obtain the rejection ratio for the ternary normal distribution data. Table 9 (Ternary normal distribution) Number of simulations: 5000

### How to calculate the eigenvalue, contribution rate and cumulative contribution rate of the correlation coefficient matrix of a set of data with spss

With spss software find a book or download a textbook on spss, it is very clear. You can get it by factor analysis, do the modeling inside analyze-datarection-factor~~~spss.

### How to do orthogonal analysis with SPSS

This can be done in spssau:

1, for example, to do a three-factor, three-level interaction orthogonal table,

Options for the number of factors to choose 3, the number of levels is also 3, click on the “start the analysis”, done.

After the test is completed, you can use ANOVA to conduct research.

### spss eigenvector matrix how to operate

1, first find the eigenvalues of the correlation coefficient matrix and the corresponding eigenvectors.

2, secondly, the eigenvectors will be arranged into a matrix according to the size of the corresponding eigenvalues from top to bottom in rows.

3. Finally, the cumulative contribution rate is selected to be at 85%.

### Self-study Notes for R in Action 71-Principal Component and Factor Analysis

Principal Component Analysis

Principal Component Analysis ((PCA) is a data dimensionality reduction technique that transforms a large number of correlated variables into a set of very few uncorrelated variables, which are referred to as principal components (linear combinations of the original variables). linear combination of the original variables). The overall idea is to simplify and capture the key to the problem, which is the idea of dimensionality reduction.

Principal Component Analysis (PCA) is a method of analyzing things through appropriate mathematical transformations that make the new variables – principal components – linear combinations of the original variables, and selecting a few principal components that have a large proportion of the total information in the variance. The larger the proportion of the principal components in the amount of information in the variance, the greater its role in the comprehensive evaluation.

Factor Analysis

ExploratoryFactorAnalysis (EFA) is a series of methods used to discover the underlying structure of a set of variables. It explains observed, explicit relationships between variables by looking for a smaller set of underlying or hidden structures.

Differences Between PCA and EFA Models

See Figure 14-1. The principal components (PC1 and PC2) are linear combinations of the observed variables (X1 through X5). The weights that form the linear combinations are obtained by maximizing the variance explained by each principal component, while also ensuring that the individual principal components are uncorrelated. In contrast, the factors (F1 and F2) are treated as structural bases or “causes” of the observed variables, rather than as linear combinations of them.

The base installation package of R provides functions for PCA and EFA, princomp() and factanal(), respectively.

Most common analysis steps

(1) Data preprocessing. both PCA and EFA derive results based on correlations between observed variables. The user can input either the raw data matrix or the correlation coefficient matrix into the principal() and fa() functions. If you enter the initial data, the correlation coefficient matrix will be calculated automatically, please make sure there are no missing values in the data before calculation.

(2) Select factor model. Determine whether PCA (data dimensionality reduction) or EFA (discovery of underlying structure) is more in line with your research goals. If you choose the EFA method, you will also need to choose a method for estimating the factorial model (e.g., maximum likelihood estimation).

(3) Determine the number of principal components/factors to choose.

(4) Select the principal components/factors.

(5) Rotate the principal components/factors.

(6) Interpret the results.

(7) Calculate principal component or factor scores.

The goal of PCA is to replace a large number of correlated variables with a smaller set of uncorrelated variables while retaining as much information as possible about the initial variables, and these derived variables are called principal components, which are linear combinations of the observed variables. For example, the first principal component is:

It is a weighted combination of the k observed variables that explains the most variance in the initial set of variables. The second principal component, which is also a linear combination of the initial variables, has the second highest explanation of variance and is also orthogonal (uncorrelated) to the first principal component. Each subsequent principal component maximizes the degree to which it explains the variance, while being orthogonal to all previous principal components. Theoretically, you could pick as many principal components as there are variables, but from a practical standpoint, we all want to approximate the full set of variables with fewer principal components.

Relationship between principal components and original variables

(1) The principal components retain the vast majority of information about the original variables.

(2) The number of principal components is considerably less than the number of original variables.

(3) The principal components are uncorrelated with each other.

(4) Each principal component is a linear combination of the original variables.

The dataset USJudgeRatings contains attorney ratings of U.S. Superior Court judges. The data frame contains 43 observations and 12 variables.

Guidelines used to determine how many principal components are needed in PCA:

determines the number of principal components based on a priori empirical and theoretical knowledge;

determines the number of principal components needed based on a threshold for the cumulative value of the variance of the variable to be explained;

judged the number of principal components to be retained by examining the matrix of k x k correlation coefficients between the variables.

The most common is the eigenvalue-based approach. Each principal component is associated with the eigenvalues of the correlation coefficient matrix, with the first principal component associated with the largest eigenvalue, the second principal component with the second largest, and so on.

The Kaiser-Harris criterion suggests retaining principal components with eigenvalues greater than 1. Components with eigenvalues less than 1 explain less of the variance than if they were included in a single variable.Cattell’s fractional test, on the other hand, plots graphs of eigenvalues versus principal components. This type of graph gives a clear picture of the graph bending condition, and any principal components above the point where the graph changes the most can be retained. Finally, you can perform a simulation to determine the eigenvalues to be extracted based on a random data matrix of the same size as the initial matrix. If an eigenvalue based on real data is larger than the corresponding average eigenvalue of a set of random data matrices, then that principal component can be retained. The method is called parallel analysis.

Graphic interpretation: graph consisting of line segments and x symbols (blue line): eigenvalue curve;

red dashed line: average eigenvalue curve derived from 100 random data matrices;

green solid line: eigenvalue criterion line (i.e.: horizontal line at y=1)

Discriminatory criterion: eigenvalues larger than the average eigenvalue and larger than the y=1 eigenvalue criterion line, is considered to be the principal component that can be retained. According to the discriminating criteria, 1 principal component can be retained.

fa.parallel function learning

fa.parallel(data,n.obs=,fa=”pc”/”both”,n.iter=100, show.legend=T/F)

data: raw data data frame;

n.obs: gives the number of raw data (not raw variables) when data is a correlation coefficient matrix, this parameter is ignored when data is a raw data matrix;

fa: “pc “pc” is for calculating principal components only, “fa” is for factor analysis, “both” is for calculating principal components and factors;

n.iter: simulate the number of parallel analysis;

show. legend: show legend.

principal(r,nfactors=,rotate=,scores=)

r: matrix of correlation coefficients or matrix of raw data;

nfactors: set the number of principal components (default 1);

rotate: specify the method of rotating, default maximal variance rotation ( varimax).

scores: set whether the principal component scores need to be calculated (default no).

The PC1 column contains the component loadings, which refer to the correlation coefficients of the observed variables with the principal components. If more than one principal component is extracted, then there will also be columns for PC2, PC3, and so on. Componentloadings can be used to explain the meaning of the principal components and to explain the degree of correlation of the principal components with each variable.

Column h2 is the component metric variance, which is the degree of variance explained by the principal components for each variable.

Column u2 is component uniqueness, i.e., the portion of the variance that cannot be explained by the principal components (1-h2).

SSloadings contains the eigenvalues associated with the principal components, which means the standardized variance associated with a particular principal component, i.e., it can be used to see by how many components 90% of the variance can be explained by the principal components (i.e., you can check all the eigenvalues out by using nfactors=number of original variables, and of course you can do it directly by using the (you can use nfactors=number of original variables to find out all the eigenvalues, or you can directly check the eigenvalues by using the eigen function on its correlation matrix.)

ProportionVar indicates how much each principal component explains about the whole data set.

CumulativeVar represents the sum of the degree of explanation of each principal component.

ProportionExplained and CumulativeProportion are the division of principal components and their cumulative percentages by the percentage of total explained variance available, respectively.

Results Interpretation:The first principal component (PC1) is highly correlated with each variable, i.e., it is a dimension that can be used for general evaluation.99.1% of the variance of the ORAL variables can be explained by the PC1, and only 0.91% of the variance cannot be explained by the PC1. The first principal component explained 92% of the variance of the 11 variables.

Results Interpretation: the number of principal components chosen can be determined to be 2 by the gravel plot.

Results Interpretation: from the results ProportionVar: 0.58 and 0.22 it can be determined that the first principal component explains 58% of the variance in the body measurements, while the second principal component explains 22%, and the two together explain 81% of the variance. For the height variable, the two in turn explained a total of 88% of its variance.

Rotation is a series of mathematical methods that make the array of component loadings easier to interpret by denoising the components as much as possible. There are two types of rotation methods: keeping selected components uncorrelated (orthogonal rotation), and making them correlated (oblique rotation). Rotation methods also vary depending on the definition of denoising. The most popular orthogonal rotation is the variance-extreme rotation, which attempts to denoise the columns of the loadings array so that each component is explained by only a finite set of variables (i.e., there are only a handful of very large loadings in each column of the loadings array, and the rest are very small loadings). The names of the columns in the resulting list are changed from PC to RC to indicate that the components are rotated.

When scores=TRUE, the principal component scores are stored in the scores element of the object returned by the principal() function.

Factor analysis can be used if your goal is to seek potential hidden variables that explain the observed variables.

The goal of EFA is to explain the correlation of a

group of observable variables by uncovering a smaller, more fundamental set of unobservable variables hidden under the data. These dummy, unobservable variables are called factors. (Each factor is thought to explain variance common to more than one

observable variable, so they should accurately be called common factors.)

Where is the ith observable variable (i=1..k), is the common factor (j=1..p), and p<k. is the part of the variable that is unique to the variable (and cannot be explained by the common factor). It can be thought of as the value of each factor’s contribution to the observable variable from which it is composited.

The first two eigenvalues (triangles) of the gravel test are above the corners and are greater than the mean of the eigenvalues based on a matrix of 100 simulated data. For EFA, the number of eigenvalues for the Kaiser-Harris criterion is greater than 0, not 1.

Results Interpretation: the PCA results suggest extracting one or two components, and EFA suggests extracting two factors.

fa(r,nfactors=,n.obs=,rotate=,scores=,fm=)

r is the matrix of correlation coefficients or the matrix of raw data;

nfactors sets the number of factors to be extracted (default is 1);< /p>

n.obs is the number of observations (required when entering the correlation coefficient matrix);

rotate sets the method of rotation (default is the method of minimizing the number of reciprocal variances);

scores sets whether or not to compute the factor scores (default does not calculated);

fm sets the factorization method (default minimal residual method).

Unlike PCA, there are many methods for extracting common factors, including maximum likelihood (ml), principal axis iteration (pa), weighted least squares (wls), generalized weighted least squares (gls), and minimal residuals (minres). Statisticians favor the maximum likelihood method because of its good statistical properties.

Results Interpretation: the ProportionVar of the two factors was 0.46 and 0.14, respectively, and the two factors explained 60% of the variance of the six psychology tests.

Results Interpretation: reading and vocabulary loaded more heavily on the first factor, drawing, block patterns, and mazes loaded more heavily on the second factor, and nonverbal measures of general intelligence loaded more evenly on the two factors, suggesting the presence of a verbal intelligence factor and a nonverbal intelligence factor.

Differences between orthogonal and oblique rotations.

For orthogonal rotation, factor analysis focuses on the factor structure matrix (the correlation coefficients of the variables with the factors), whereas for oblique rotation, factor analysis takes into account three matrices: the factor structure matrix, the factor pattern matrix, and the factor correlation matrix.

The factor pattern matrix is the standardized regression coefficient matrix. It lists the weights of the factor predictor variables. The factor correlation matrix is the matrix of factor correlation coefficients.

Graphic interpretation: vocabulary and reading loaded more on the first factor (PA1), while block patterns, drawing and mazes loaded more on the second factor (PA2). The General Intelligence Test was more even on both factors.

Unlike principal component scores, which can be calculated precisely, factor scores are only estimated. It can be estimated in a variety of ways, with the fa() function using a regression method.

R includes a number of other packages that are very useful for factor analysis.The FactoMineR package not only provides PCA and EFA methods, but also includes latent variable modeling. It has many parameter options that we did not consider here, such as the use of numeric and categorical variables.The FAiR package uses genetic algorithms to estimate factor analysis models, which enhances model parameter estimation and can handle inequality constraints, and the GPArotation package provides many factor rotation methods. Finally, there is the nFactors package, which provides many sophisticated methods used to determine the number of factors.

Principal Component Analysis

1. Data Import